MANAGING EDITOR’S NOTE: The following is from an oral presentation given on July 20, 1999 during the Fourth IAMSE Association Meeting held at Georgetown University School of Medicine in Washington, DC.
I will first address the current status of Computer-Based Testing (CBT) for United States Medical Licensing Examination (USMLE) Step 1, which is always of interest to audiences like this, and second will describe the testing software that we are developing at NBME which has the code name FRED. From the outset, let me state that this word is not an acronym with a definable meaning. In fact, it has absolutely no significance at all other than being a convenient reference name for the program. Third, I would like to speak about the National Board of Medical Examiners (NBME) Subject Tests since we are about to computerize those also. I believe this is an area where your input will be crucial. With the USMLE, we are fairly constrained with what changes can be made since this is an exam which must satisfy state licensing boards. However, there is much more latitude with the Subject Test program since it is essentially designed as a service to medical schools. With your comments and suggestions, it could be made even more helpful.
CBT for USMLE
We will begin with progress in the implementation of the USMLE Step 1. To reiterate, this involves seven 50-item blocks taken over an eight hour period. Virtually all items are in single-best answer format (A-type), and there are typically five options, the number can vary from three up to a dozen or more. Content coverage is parallel in each block. Originally, we had planned to do adaptive testing because it will actually help us to retain as much accuracy and reliability as possible, given that we’re shortening the exam. However, we eventually decided for various reasons to delay adaptive testing, and currently each block is equivalent in mean item difficulty. There will be about 18,000 U.S. medical students taking the USMLE Step 1 during 1999. To date, over 17,000 of these have been tested. To summarize progress to date, there has been a deafening silence from medical schools in terms of problems. We assume that means everything is functioning more or less as planned. In terms of reported problems, the rate is approximately 0.5%. Whether that is good or bad depends on if you are one of the small number of students who encountered a problem.
Problems in the Examination Process
It is important to consider just what types of problems occur. For convenience, these are split into three groups. The first group has been software time-out problems. For the USMLE Step 1, the session clock allows seven hours of testing in an eight-hour period. However, if there is a power failure, we have found that the session timer may not stop, in which case students may be “shorted” time when the computer is rebooted. Similarly, on a few occasions, Sylvan personnel in trying to be helpful turned on workstations before students had logged in to start the exam. Since this could result in the timer starting, some students were again timed out. It is NBME policy (for many complex reasons) that students may not restart the same test on a different day. Students who had timed out because of this problem were required to sit the test again. Not surprisingly, they were very unhappy about this. This timing software “bug” has now been remedied, and fortunately in the grand scheme of things, this problem had a serious impact on only a few students.
The second group relates to the occasional occurrence of scheduling problems. In one incident we received a phone call from an anxious candidate who thought he had scheduled to take the exam in East Chicago when in actuality, he was scheduled for West Chicago. This candidate suffered the anxiety of navigating Chicago rush hour traffic to arrive barely in time to begin the exam. There have been a few such similar incidents.
Perhaps the largest numbers of diverse problems have occurred in the third group, which is related to the quality of the testing experience. These are problems such as the room environment being too warm or too noisy, the proctor being rude, or the bathrooms being dirty. Some of these issues are real, but our problem is that, we cannot know for certain after the fact how warm it was, or how noisy it was, and what constitutes an unreasonable problem with the exam environment. Many anecdotes have been circulated within the examinee community. There was for example an individual who complained that the computer was bouncing up and down on the desk because there was a jackhammer in operation next door. We checked into this and sure enough, there was a jackhammer in operation. There was another candidate who for inexplicable reasons (at least to us), removed clothes down to his underpants while he was taking the exam! While he did not actually complain, another candidate in the center did. There are many such interesting and sometimes humorous stories. Our main purpose in investigating these incidents is to be certain that they did substantially affect candidate performance. Obviously, some of this is in the eye of the beholder. For example, many will remember in the early days of preparing for CBT, there was enormous reaction to the concept of examinees being unable to return to questions and change their answers; most felt that this would seriously impact their scores. Consequently, during one of the field studies, we conducted a relevant study that indicated the ability to change answers made. However, it made a great difference to the comfort level of the examinees just knowing that they could change answers, and in the end, NBME relented and allowed examinees to change answers. Although we have not specifically studied other potentially “hot-button” issues, I suspect this is likely true for a wide variety of other options, such as underlining questions, striking out words, scribbling on the book, etc. These are issues more of importance to examinee comfort and most likely would not significantly affect overall performance.
At present, we are still collecting information about the testing environment and are using the following methods. First, a series of surveys are sent to candidates who have completed the exam. As far as quality assurance is concerned, all students are asked to complete at least one survey concerning their test experience. Some have also been queried by telephone as to the testing process, others have been queried by e-mail and some have received paper and pencil surveys. Questions ask about what was good and what was bad.
A second mechanism we are using to assess environmental conditions during the USMLE is to employ what the corporate world refers to as “Secret Shoppers”. Essentially, these are individuals sent as exam candidates, but whose function it is to monitor and report back on exam conditions. This is not a new technique, and Sylvan already does this in test centers to ensure that the quality of the testing experience is adequate. However, the NBME wanted its own Secret Shoppers in addition to those employed by the Sylvan Centers. In a few cases, we actually tested the security of the exam by attempting deliberate breeches of security. Some of our staff turn out to be quite proficient at being dishonest! One example of what we have attempted are to have two individuals exchange places. Needless to say, this really stresses the Sylvan system, but we wish to observe how the proctor will respond. Through these efforts we hope to obtain a broader picture of the actual examination environment both in the United States and at the 300 Sylvan sites worldwide where the USMLE is administered.
The next issue that I would like to address concerns practicing for computer-based testing. In the initial phases of development of the computerized USMLE, many faculty members expressed concerns about the computer literacy of their students. Students had somewhat less concern since many used computers on a daily basis. Nonetheless, faculty members were concerned about the level of preparedness of their students. To address this concern, NBME developed means to permit students to take practice exams. Initially, 150 sample items were distributed on a compact disk (CD) that utilized the same driver as in the Sylvan Centers. This parallels the system so closely that individuals can take a timed exam with exactly the same pacing as the real USMLE. At present we are collecting data on how these CDs are being used and whether they found it helpful, etc. My guess is that most exam candidates will look through it. At a minimum, they can at least acquaint themselves with the interface experience. The NBME is interested in hearing from you if you have any comments, suggestions, or information as to how these are being used by the exam candidates. In addition to the CD, these materials are available on the USMLE website (http://www.usmle.org/),and practice materials are also available at all Sylvan Centers. Occasionally a candidate may request to take the practice exam at the actual site where they are scheduled to take the “real” USMLE. This can be arranged, although there is a nominal fee of $42.00, which is charged for “seat time” since during this time, that Sylvan site cannot be used for other purposes.
Examination Score Reporting
Reporting of examination scores is always an important issue! The first necessity was to collect a sufficient number of students so that the items could be recalibrated. This is because it is possible that the difficulty of some questions might change because of the transition from paper and pencil to CBT. Recalibration requires a relatively large group. It was decided to hold back the first 10,000 scores, do the recalibration on them, and then report all 10,000 scores together during August 1999. The net result would be that everyone should have received their score by approximately the time of which they received it with the former paper and pencil exam, i.e. around the middle of August. Following this first group of 10,000, scores will likely be reported on a weekly basis. That means one day each week all scores for the examinees since the previous week will be reported. On that day, each school will be able to access student scores by means of their secured website.
I should emphasize that we will no longer be reporting percentiles, but we will continue to report school specific performance. These reports to the dean’s office will continue to show your school scores in each discipline as a function of national means. Our reporting format has been slightly modified because in the past there have been instances where data has been misinterpreted. For example, if anatomy is compared with pharmacology, most schools will exhibit a difference. This reflects a nationwide difference in the mean score between anatomy and pharmacology. Thus, it is invalid to compare anatomy and pharmacology scores. What should be compared are anatomy scores of your students with anatomy scores nationally, and pharmacology scores of your students with pharmacology scores nationally.
In each school someone in the dean’s office will have authority to view one or more of the functions offered. The system requires a “smart card” for access. This is about the size of a credit card upon which the user must enter their name and personal identification number. A special number is also contained within this card, which must be entered at the prompt. This number will be verified within the records at the NBME before access is granted to that user. These precautions help ensure that only the correct person will achieve access to records of an individual school. This is most important, as NBME does not want schools to be accessing each other’s data without proper authorization. If a school wishes to share their data they certainly may, but to do so must be under their control. Initially, each website will offer three functions:
- Reporting of scores
- Confirming eligibility of students to sit for the USMLE
- Student status report. This includes dates for registration, mailing of eligibility permit, scheduling and taking the exam, and whether or not that score was pass or fail.
This information provides the dean or other individuals at the medical school with an opportunity to instantly track an individual student, and provide counseling if appropriate.
Overall, CBT for the USMLE seems to working well for the majority of students that have sat thus far. The USMLE has now therefore joined a number of other health care professionals, particularly the nurses, in computerizing their high stakes examination.
FRED – The NBME’s New Testing Software
I have a demo to show of the driver, which runs the test items through, puts them up on the screen and records the answers. This particular driver is quite a bit different from the current Sylvan driver, and includes several new features. For example, examinees may underline words or strike them out. Examinees may also add annotation comments if they wish. We will take a look at one of the exam modules. The items are on the right hand side, and on the left hand side there is a summary of what is going on. Examinees may either type, or point and click with a mouse. Note that if I select an option, say C, and then change my mind to D, then C un-selects itself. This illustrates that only one answer can be selected at a time. This differs from the paper-and-pencil exam where students who change their answer may incompletely erase the previous answer, in which case no score can be given.
The right mouse button brings up a menu and you can see the words “highlight?, “strikeout” or “annotate”. On the left side, there is a summary of what is going on. It tells me I am on question one. It also tells me I have just added an annotation to that question so that I may navigate back to it and see the annotation. I can also place a “bookmark” on the question. Thus, between the summary area, the annotation, and the bookmark features, we have a strategy that allows instant, random access to any question in the order that we would like to view and/or answer it. The other feature we have is an exhibit, e.g. for items where we have pictorials. If we wish to bring that exhibit up we may, but it does not clutter up the screen unnecessarily if we do not want it. Finally, I should emphasize that this is just the driver. FRED does not yet include components that do scoring and scheduling and all the other tasks that we require to be regarded as a comprehensive software package.
Our plans call for substituting FRED for the current Sylvan software when development is complete and adequate testing has occurred. This will allow the use of more innovative item types, for example including multiple pictorials, sound, and moving pictures in multi-media approaches. We could even explore the feasibility of problem-solving or information-gathering exercises, e.g. with use of the Internet.
Subject Tests (Shelf Exams)
We at NBME are also making a number of changes in relation to the Subject Test program, following a strategic plan that was developed a year or so ago. The two main thrusts I would like to mention are computerization and customization. What is the rationale for computerizing a perfectly good paper and pencil exam? Possible reasons include:
- practicing for USMLE with exams that have the look and feel of CBT USMLE
- diagnostics (including prediction of USMLE score)
- greater flexibility in timing
- richer interface (sound, moving pictures, multi-media, simulations)
- enhanced security
- testing laboratory for new ideas and approaches to assessment
However, I must stress that regardless of when or how fast we develop a CBT alternative for Subject Tests, paper and pencil tests will remain as a viable alternative for the foreseeable future for those schools that do not wish to use CBT. believe it would even be reasonable to use CBT and paper-and-pencil testing for different courses at the same school, although it is probably unwise to mix them for a single course, simply because students may not believe that they yield completely comparable grades.
The second major new thrust for the Subject Test program is Customization. Probably many of you are involved in the current rich period of curricular development and evolution that is occurring both in this country and to some extent in Europe. To this point we have produced only discipline-based subjects. However, this appears to contribute to a creeping dissidence between what we offer in the Subject Test and what instructors would like to see in such an exam if used for the course actually being taught. This leads to the idea that we should begin the process of customizing Subject Exams for individual courses and individual schools. To do so, we likely will need some new exams. These would probably be more interdisciplinary in nature, e.g. Genetics, Cell Biology, or even Comprehensive Year 01 or year 02 exams. Second, we should consider customizing blueprints by building more modular exams along the lines of our current Physiology exam, which comes with or without Neurophysiology. Third, a parallel customization can occur at the level of scoring. With the proviso that the number of relevant items may be small, we may be able to give some idea of performance in each different subsection or content categories in comparison with the total. Fourth, and this is conjectural and several years down the road, we might just make our entire Subject Test item pool available to schools. Individuals could build their own examinations and NBME would score them. However, the difficulty with all these approaches is the possibility that multiple different exams attuned to different schools would make national comparisons very difficult. And, of course, this could involve added expense. Our next step is likely to survey schools to determine the level of interest in these various options.
Actually implementing CBT for Subject Tests in medical schools will require much careful thought since there are several significant impediments. First we require comprehensive testing software, i.e. FRED, both to serve the particular needs of this testing program and to avoid dependence on Sylvan software. We have already spent a significant amount of time and money developing the driver part of FRED to beta test version.
Second, we must be certain that the performance on a paper and pencil test is the same on a computer-based test.
We have done that analysis for the USMLE and found absolutely no difference. We are assuming the same would also apply for Subject Tests, but those studies must be done to be certain.
Third, we must have medical school centers of adequate size. When we computerized the USMLE, many schools expressed interest in opening center in their schools, and as of July we have eight fully operational centers. These USMLE Centers have a seat capacity which is only a fraction (10% or less) of the total class size. Assume that we have 100 students and that a testing center has five seats. During the month of June, with 20 working days, 100 students could take the USMLE. On the other hand for Subject Tests, typically instructors wish to test the group today, move them on and start the next course or the next clerkship tomorrow. Thus, we would need much larger centers to accommodate Subject Tests. Development of larger dedicated centers to the same level of security as for USMLE would require large amounts of space and be very expensive. Even with a large capacity center (e.g. 25% of class size), four sequential sessions would be needed to test a full class in a single topic in a single day. Because of this, we must consider other innovative ways of configuring test centers for Subject Tests, i.e. large temporary centers in shared space with proctoring but without video monitoring.
In summary, I believe that computerized testing is here to stay. It is not “just a phase” though which medical education is transitioning, but rather a benefit that technology offers to better train and evaluate our students. This trend will continue to be reinforced with every medical school that experiments with its own computerized course examinations.