References on instructor evaluation forms used in higher education
September 2003
assembled by Bill Perttulathumbs upsee also  http://userwww.sfsu.edu/%7Eperttula/ratings/


College Faculty as Teachers
by   Martin J. Finkelstein
in the NEA 1995 Almanac of Higher Education, National Education Association        (pages 35-42 are below, full article:33-47)
[excerpts]...................

EVALUATING TEACHING

How well do faculty teach? Evaluating the classroom effectiveness of college faculty developed from modest beginnings into a cottage industry in one generation. Nearly a dozen organizations distributed and scored instruments that record student perceptions of courses and instructors, including the Purdue cafeteria system, the Student Instructional Report (SIR) form of Educational Testing Service, and the Kansas State IDEA form. These evaluations assumed that students—the consumers of teaching—were in the best position to assess its quality.

Student evaluation of courses, instruction, and instructors is pervasive. Four out of five campuses use some student rating instrument to aid in personnel decisions, improvement of teaching, or student course selection.3l

Recently, some colleges complemented student evaluations with alumni ratings, faculty self-assessments, and peer assessments, including classroom visitation and review of curricular and teaching materials.32  This trend culminated in the use of teaching "portfolios," or documentation of teaching materials, activities, and eventually student products.33   Such systematic documentation could provide the basis for meaningful peer review of teaching just as publications provided the basis for peer review of faculty scholarship.

But student evaluations still constitute the dominant mode of assessment. As their use spread, pressing questions were raised about how solid a foundation the ratings provided. Were the ratings reliable and valid indicators of teaching effectiveness? Were they subject to situational distortions? How did they compare with peer ratings or faculty self-assessments? Such questions spawned a 20-year explosion of studies on student ratings of instruction, each usually reporting data from a single campus or from a single instrument. So voluminous were the findings, that a "meta-literature" reviewed and synthesized diverse, and frequently contradictory, findings.34

Analysis of this meta-literature was facilitated by the emergence of sophisticated quantitative techniques that permitted the reanalysis and integration of findings from many studies. Treating each individual study as a case or subject, these "meta-analyses" collected data on sample size and composition, variables examined, instruments employed, and the findings expressed as descriptive statistics—a zero-order correlation coefficient, for example. These efforts allow us to systematically relate patterns in the findings to characteristics of the studies themselves, thereby helping to locate the source of variation in the findings.

What does this comprehensive and integrated research base tell us about student rating instruments and their use? First, the ratings obtained by those "state of the art" nationally developed and disseminated instruments (Purdue, SIR/ETS; IDEA/Kansas State) were highly reliable; their results were consistent across administrations to the same group.35   The instruments were also highly valid in at least three respects:

● They accurately reflected student opinions assessed independently, i.e., they measured what they purported to measure.

● They were positively associated (r = 0.5) with student learning and academic achievement in the course. They measured something meaningfully related to good teaching.36

● They were highly correlated with colleague ratings, zero order correlations of about 0.   9.

What aspects of good teaching do these ratings instruments measure reliably and validly? Factor structures differed somewhat across diverse instruments and diverse studies, but Kenneth Feldman identified three basic dimensions that have proved subsequently to be virtually universal.   37

● Presentation skill, including items related to the ability to "stimulate student interest," to "clarity," to "knowledge of subject matter."

● Rapport, including "sensitivity to individual student learning," "sensitivity to class progress," and "availability" or "openness."

● Course organization and classroom management, including syllabus, assignments, grading practices, etc.  33

These were not the only elements of good teaching. They did not, for example, tap the "currency" of the instructor's course materials or the instructor's standing in the field. They did, nonetheless, consistently and reliably tap several acknowledged aspects of good teaching and could readily be supplemented by other sources.

What about bias? Didn't students rate those instructors highest when they get higher grades, for example? Faculty members widely perceived bias in student ratings but the literature suggested that there is much less bias than most faculty think. Indeed, when bias existed, the effects were relatively small, and they were consistent, that is, the patterns were well-recognized and can be accounted for.39   Most notable among these were the systematic differences in the ratings of courses in different disciplines. Table 5 arrays the disciplines from most to least highly rated. Courses in quantitative fields—such as mathematics, engineering, and the hard sciences—tended to receive lower ratings, while certain humanities fields— art and music, for example—were rated highest. It is not clear to what extent these patterns reflected differences in the nature of the fields (the extent, for example, that learning was sequentially organized), in the nature of students majoring in different fields (different kinds of students rated differently), or in the nature of faculty teaching in these fields (scientists may have been poorer teachers).40

Beyond discipline, course characteristics may have modestly affected ratings though not always in the predicted direction. Smaller classes tended to be rated more highly than larger classes; but difficult courses tended to be rated slightly more highly than easier ones.4l Characteristics of the instructor, including gender and age, or of the student, including gender, age, grade level, GPA, anticipated grade, bore no statistically significant relationship to ratings.42

If the instruments are reliable and valid indicators of three important aspects of good teaching and are not subject to unreasonable bias, what about their use in personnel decisions and in instructional improvement? Sophisticated instruments, to be blunt, may not be matched by sophisticated use. About half of the faculty in a sample who used ratings in personnel decisions could not identify likely sources of bias in the results. Nor could these faculty members recognize standards for proper samples sufficient numbers of students per course, sufficient numbers of courses per instructor—or interpret common descriptive statistics.43 Instructional improvement, several studies showed, was only likely to occur when skilled consultants helped faculty interpret the ratings and develop better teaching strategies.44

Together, these studies suggest conditions for the appropriate use of ratings instruments. For personnel decisions, global or summary ratings—rather than ratings on individual components—provided the best basis for comparing faculty.45 Global ratings were relevant to all types of courses in all fields; specific dimensions such as rapport, for example, might be less relevant to large classes or to the natural sciences. Global ratings were more clearly related to student academic achievement, 0.5 zero order correlation versus 0.3 for individual dimensions, and were less subject to abuse by the untrained user trying to weigh various dimensions .4fi
Sample size was important—meaningful comparisons require sufficient numbers of courses and of student raters. Systematic (average) disciplinary differences must also be factored into comparisons. So must other indicators of good teaching, syllabi or final examination results, for instance, that tapped dimensions that ratings did not measure, including currency in the field and student achievement.

Instructional improvement, in contrast, required a focus on many descriptive and behavioral items. There was no need for concern about sampling a single set of ratings. Ratings users, the literature suggests, may need task-specific training, detailed instructions, and a skilled consultant to guarantee appropriate use of ratings for personnel decisions and instructional improvement.47 Student evaluations provided a reliable and valid vehicle for assessing key aspects of good teaching as long as their use was informed by a sense of their strengths and limitations.

...............
CONCLUSIONS

Seven propositions summarize our knowledge about college faculty as teachers:

● College and university faculty saw themselves primarily as teachers and did, in fact, spend much of their time teaching. That has not changed appreciably in the last generation. Moreover, faculty did not want it any differently, except insofar as they needed to publish to obtain promotion and tenure.

● Lecture was still the predominant mode of instruction, approximately 80 percent of class time, but there were clear differences by institutional type and discipline.

● Faculty, save for secondary school teachers who go to community colleges, were illprepared, at best, for their initial teaching experience. The initial years of their first full-time academic appointment were therefore characterized by high stress generated by the teaching role, and by limited support. Faculty sustained the teaching orientation developed during this period throughout their careers.

● Most faculty increased their interest in teaching during their careers. But many instructors, lacking the support and opportunity to sustain their teaching commitments, were subject to "burnout."

● Teaching was typically assessed via student evaluations, but peer and self-assessment have recently received increased attention. Student ratings were highly reliable and valid; but their use in faculty personnel processes could be strengthened.

● Three dimensions of good teaching reflected faculty classroom practices: presentation skills (content mastery, clarity, and enthusiasm, for example), rapport (sensitivity and openness to students and their progress), and course organization and classroom management. Involvement in research seemed to contribute to content mastery, less so to rapport and course organization. But research commitments did not detract from teaching effectiveness, and devoting more time to teaching might not translate into greater effectiveness.

● Many faculty members desired to improve their teaching, but lacked the necessary knowledge and resources. Successful faculty development required institutional support at key points—just starting out, for example, and mid-career. Above all, teaching must be deisolated and discussed with colleagues, that is, seen as "community property."

QUO VADIS?

A word of caution. Our knowledge about college faculty as instructors is primarily based on studies of full-time, four-year, liberal arts faculty, a shrinking segment of the profession. The American professorate is becoming an occupational group that is increasingly parttime (40 percent by headcount and rising), that teaches in the professional and applied fields, and that works in two-year institutions. The object of inquiry is changing before our eyes almost before we can get hold of it.

Forthcoming publication of the results of the National Survey of Postsecondary Faculty1993, conducted by the National Center for Education Statistics, will provide us with updated information about faculty career trajectories. The survey's large sample of part-time and community college faculty will enable us to trace the changing contours of the academic profession, and to use these changes to strengthen faculty and instructional development activities in the 21st century.
College Faculty as Teachers by Martin J. Finkelstein
in the NEA 1995 Almanac of Higher Education, National Education Association
(pages 35-42; full article:33-47)

Navigating Student Ratings of Instruction ,  By: Sylvia d'Apollonia, Philip C. Abrami, American Psychologist, 0003-066X, November 1, 1997, Vol. 52, Issue 11
excerpts..........
Student ratings of instruction, first introduced into North American universities in the mid-1920s (Doyle, 1983), have been the subject of a huge body of literature including primary studies, reviews, and books. This literature has dealt not only with the psychometric properties of student ratings (Costin, Greenough, & Menges, 1971; Doyle, 1983; Feldman, 1978; Marsh, 1984) and the factor structure of ratings (Kulik & McKeachie, 1975; Marsh, 1991a, 1991b; McKeachie, 1979) but also with practical guides to faculty evaluation (Arreola, 1995; Braskamp & Ory, 1994; Centra, 1993). Many researchers have concluded that the reliability (Centra, 1993) and the validity (Abrami, d'Apollonia, & Cohen, 1990; Cohen, 1981; Feldman, 1989, 1990; Marsh, 1987) of student ratings are generally good and that student ratings are the best, and often the only, method of providing objective evidence for summative evaluations of instruction (Scriven, 1988). However, a number of controversies and unanswered questions persist. Our objective in this article is to briefly and selectively review some of the literature on student ratings of instruction that addresses the following three questions: What is the structure of student ratings of instructional effectiveness? Is instructional effectiveness substantially correlated with other indicators of instructor-mediated learning in students? To what extent do student ratings confound irrelevant variables with instructional effectiveness?

......
The average reliabilities of the student rating and achievement instruments [usually exam results] in the multisection validity studies were . 74 and . 69, respectively. Therefore, when the correlation coefficient between student ratings of general instructional skill and student learning was corrected for attenuation (Downie & Heath, 1974), the correlation became . 47, with a 95% confidence interval extending from . 43 to . 51. Thus, there was a moderate to large association between student ratings and student learning, indicating that student ratings of General Instructional Skill are valid measures of instructor-mediated learning in students. The variability in the validity coefficients reported by the primary investigators can be explained, in large part, by differences in the sampling variance of the individual studies.
........
In conclusion, the published multisection validity literature suggests that under appropriate conditions (all instructors are faculty members, evaluation is carried out prior to students’ knowing their final grade, sections are equivalent in terms of student prior ability or equivalence is experimentally controlled) and the validity coefficient is corrected for attenuation, more than 45% of the variation in student learning among sections can be explained by student perceptions of instructor effectiveness. This 45% figure is, of course, an estimate of validity under the above circumstances. In practice, with the usual reliability of measures and the ratings context, only partially matching the set of conditions, the validity coefficient will be different. In any case, the data set is homogeneous, indicating that across different students, courses, and settings, student ratings are consistently valid.
.........
Laboratory studies that have used the "Dr. Fox" paradigm have shown that instructors’ expressivity (Abrami, Leventhal, & Perry, 1982; Abrami, Perry, & Leventhal, 1982) and grading practices (Abrami, Perry, & Leventhal, 1982) can unduly influence student ratings of instruction. However, these studies also have indicated that these variables do not seriously bias student ratings. Expressivity has a practically meaningful influence on student ratings of instructors, with high-expressive instructors scoring about 1. 20 standard deviations above low-expressive instructors. However, instructors’ expressivity also influences student learning. Researchers who have investigated instructors’ expressivity have concluded that it is not a biasing variable but rather that it exerts its influence by affecting student learning (Murray, Rushton, & Paunonen, 1990). Liberal grading practices increased student ratings, at most, by slightly less than 0. 5 on a 5-point scale. Moreover, in some cases they decreased ratings. Thus, grading practices are not a practical threat to the validity of student ratings. In both cases, controlling for the putative biasing variable "overcorrects" the influence of the putative biasing variable, removing some of the genuine influence of effective instruction from the rating scores. In effect, poor instructors may be doubly rewarded both by students and by evaluators. Instead, we recommend that student ratings not be overinterpreted. In general, experts recommend that comprehensive systems of faculty evaluation be developed, of which student ratings of instruction are only one, albeit important, component (Arreola, 1995; Braskamp & Ory, 1994; Centra, 1993). Within such a system, student ratings should be used to make only crude judgments of instructional effectiveness (exceptional, adequate, and unacceptable).



Five questions that must be included on all instructor evaluation forms at Texas A&M at the request of Student Senate according to a 1992 paper.  Other items are chosen from a long list by the instructor.

Table 1.
Table 1. Core items rated by students.
repeat I would take another course from this professor.
fair grade The exams were presented and graded fairly.
work amt The amount of work and/or reading was reasonable for the credit hours received in the course.
effective I believe this instructor was an effective teacher.
help Help was readily available for questions and/or homework outside of class.



 Validity Concerns and Usefulness of Student Ratings of Instruction ,  By: Anthony G. Greenwald, American Psychologist, 0003-066X, November 1, 1997, Vol. 52, Issue 11
[excerpts]....................
In summary of the relatively recent literature on student ratings, and as the following quotes indicate, prominent reviews published since about 1980 give a clear impression that major questions of the 1970s about ratings validity were effectively answered and largely put to rest by subsequent research.

In general, . . . most of the factors [that] might be expected to invalidate ratings have relatively small effects. . . . Some studies have found a tendency for teachers giving higher grades to get higher ratings. However, one might argue that in courses in which students learn more the grades should be higher and the ratings should be higher so that a correlation between average grades and ratings is not necessarily a sign of invalidty. . . . My own conclusion is that one need not worry much about grading standards within the range of normal variability. (McKeachie, 1979, pp. 390-391) Probably, students' evaluations of teaching effectiveness are the most thoroughly studied of all forms of personnel evaluation, and one of the best in terms of being supported by empirical research. . . . Although it is possible that a grading leniency effect may produce some bias in student ratings, support for this suggestion is weak and the size of such an effect is likely to be insubstantial in the actual use of student ratings. (Marsh, 1984, pp. 749, 741) [Recent] evidence has suggested . . . that rather than signaling possible contamination and invalidity of student evaluations, the observed relation between grades and student ratings might reflect expected, educationally appropriate relations. (Howard et al. , 1985, p. 187) In general, student ratings tend to be statistically reliable, valid, and relatively free from bias or the need for control; probably more so than any other data used for evaluation. (Cashin, 1995, p. 6)

These quotes not only acknowledge that grades and ratings are correlated but also express the judgment that this correlation can and should be interpreted without concluding that grades create a bothersome contamination of ratings.