College
Faculty as Teachers
by Martin J. Finkelstein
in the NEA 1995
Almanac of Higher Education, National Education
Association (pages 35-42 are
below, full article:33-47)
[excerpts]...................
EVALUATING TEACHING
How well do faculty teach? Evaluating the classroom effectiveness of
college faculty developed from modest beginnings into a cottage
industry in one generation. Nearly a dozen organizations distributed
and scored instruments that record student perceptions of courses and
instructors, including the Purdue cafeteria system, the Student
Instructional Report (SIR) form of Educational Testing Service, and the
Kansas State IDEA form. These evaluations assumed that students—the
consumers of teaching—were in the best position to assess its quality.
Student evaluation of courses, instruction, and instructors is
pervasive. Four out of five campuses use some student rating instrument
to aid in personnel decisions, improvement of teaching, or student
course selection.3l
Recently, some colleges complemented student evaluations with alumni
ratings, faculty self-assessments, and peer assessments, including
classroom visitation and review of curricular and teaching
materials.32 This trend culminated in the use of teaching
"portfolios," or documentation of teaching materials, activities, and
eventually student products.33 Such systematic
documentation could provide the basis for meaningful peer review of
teaching just as publications provided the basis for peer review of
faculty scholarship.
But student evaluations still constitute the dominant mode of
assessment. As their use spread, pressing questions were raised about
how solid a foundation the ratings provided. Were the ratings reliable
and valid indicators of teaching effectiveness? Were they subject to
situational distortions? How did they compare with peer ratings or
faculty self-assessments? Such questions spawned a 20-year explosion of
studies on student ratings of instruction, each usually reporting data
from a single campus or from a single instrument. So voluminous were
the findings, that a "meta-literature" reviewed and synthesized
diverse, and frequently contradictory, findings.34
Analysis of this meta-literature was facilitated by the emergence of
sophisticated quantitative techniques that permitted the reanalysis and
integration of findings from many studies. Treating each individual
study as a case or subject, these "meta-analyses" collected data on
sample size and composition, variables examined, instruments employed,
and the findings expressed as descriptive statistics—a zero-order
correlation coefficient, for example. These efforts allow us to
systematically relate patterns in the findings to characteristics of
the studies themselves, thereby helping to locate the source of
variation in the findings.
What does this comprehensive and integrated research base tell us about
student rating instruments and their use? First, the ratings obtained
by those "state of the art" nationally developed and disseminated
instruments (Purdue, SIR/ETS; IDEA/Kansas State) were highly reliable;
their results were consistent across administrations to the same
group.35 The instruments were also highly valid in at least
three respects:
● They accurately reflected student opinions assessed independently,
i.e., they measured what they purported to measure.
● They were positively associated (r = 0.5) with student learning and
academic achievement in the course. They measured something
meaningfully related to good teaching.36
● They were highly correlated with colleague ratings, zero order
correlations of about 0. 9.
What aspects of good teaching do these ratings instruments measure
reliably and validly? Factor structures differed somewhat across
diverse instruments and diverse studies, but Kenneth Feldman identified
three basic dimensions that have proved subsequently to be virtually
universal. 37
● Presentation skill, including items related to the ability to
"stimulate student interest," to "clarity," to "knowledge of subject
matter."
● Rapport, including "sensitivity to individual student learning,"
"sensitivity to class progress," and "availability" or "openness."
● Course organization and classroom management, including syllabus,
assignments, grading practices, etc. 33
These were not the only elements of good teaching. They did not, for
example, tap the "currency" of the instructor's course materials or the
instructor's standing in the field. They did, nonetheless, consistently
and reliably tap several acknowledged aspects of good teaching and
could readily be supplemented by other sources.
What about bias? Didn't students rate those instructors highest when
they get higher grades, for example? Faculty members widely perceived
bias in student ratings but the literature suggested that there is much
less bias than most faculty think. Indeed, when bias existed, the
effects were relatively small, and they were consistent, that is, the
patterns were well-recognized and can be accounted for.39
Most notable among these were the systematic differences in the ratings
of courses in different disciplines. Table 5 arrays the disciplines
from most to least highly rated. Courses in quantitative fields—such as
mathematics, engineering, and the hard sciences—tended to receive lower
ratings, while certain humanities fields— art and music, for
example—were rated highest. It is not clear to what extent these
patterns reflected differences in the nature of the fields (the extent,
for example, that learning was sequentially organized), in the nature
of students majoring in different fields (different kinds of students
rated differently), or in the nature of faculty teaching in these
fields (scientists may have been poorer teachers).40
Beyond discipline, course characteristics may have modestly affected
ratings though not always in the predicted direction. Smaller classes
tended to be rated more highly than larger classes; but difficult
courses tended to be rated slightly more highly than easier ones.4l
Characteristics of the instructor, including gender and age, or of the
student, including gender, age, grade level, GPA, anticipated grade,
bore no statistically significant relationship to ratings.42
If the instruments are reliable and valid indicators of three important
aspects of good teaching and are not subject to unreasonable bias, what
about their use in personnel decisions and in instructional
improvement? Sophisticated instruments, to be blunt, may not be matched
by sophisticated use. About half of the faculty in a sample who used
ratings in personnel decisions could not identify likely sources of
bias in the results. Nor could these faculty members recognize
standards for proper samples sufficient numbers of students per course,
sufficient numbers of courses per instructor—or interpret common
descriptive statistics.43 Instructional improvement, several studies
showed, was only likely to occur when skilled consultants helped
faculty interpret the ratings and develop better teaching strategies.44
Together, these studies suggest conditions for the appropriate use of
ratings instruments. For personnel decisions, global or summary
ratings—rather than ratings on individual components—provided the best
basis for comparing faculty.45 Global ratings were relevant to all
types of courses in all fields; specific dimensions such as rapport,
for example, might be less relevant to large classes or to the natural
sciences. Global ratings were more clearly related to student academic
achievement, 0.5 zero order correlation versus 0.3 for individual
dimensions, and were less subject to abuse by the untrained user trying
to weigh various dimensions .4fi
Sample size was important—meaningful comparisons require sufficient
numbers of courses and of student raters. Systematic (average)
disciplinary differences must also be factored into comparisons. So
must other indicators of good teaching, syllabi or final examination
results, for instance, that tapped dimensions that ratings did not
measure, including currency in the field and student achievement.
Instructional improvement, in contrast, required a focus on many
descriptive and behavioral items. There was no need for concern about
sampling a single set of ratings. Ratings users, the literature
suggests, may need task-specific training, detailed instructions, and a
skilled consultant to guarantee appropriate use of ratings for
personnel decisions and instructional improvement.47 Student
evaluations provided a reliable and valid vehicle for assessing key
aspects of good teaching as long as their use was informed by a sense
of their strengths and limitations.
...............
CONCLUSIONS
Seven propositions summarize our knowledge about college faculty as
teachers:
● College and university faculty saw themselves primarily as teachers
and did, in fact, spend much of their time teaching. That has not
changed appreciably in the last generation. Moreover, faculty did not
want it any differently, except insofar as they needed to publish to
obtain promotion and tenure.
● Lecture was still the predominant mode of instruction, approximately
80 percent of class time, but there were clear differences by
institutional type and discipline.
● Faculty, save for secondary school teachers who go to community
colleges, were illprepared, at best, for their initial teaching
experience. The initial years of their first full-time academic
appointment were therefore characterized by high stress generated by
the teaching role, and by limited support. Faculty sustained the
teaching orientation developed during this period throughout their
careers.
● Most faculty increased their interest in teaching during their
careers. But many instructors, lacking the support and opportunity to
sustain their teaching commitments, were subject to "burnout."
● Teaching was typically assessed via student evaluations, but peer and
self-assessment have recently received increased attention. Student
ratings were highly reliable and valid; but their use in faculty
personnel processes could be strengthened.
● Three dimensions of good teaching reflected faculty classroom
practices: presentation skills (content mastery, clarity, and
enthusiasm, for example), rapport (sensitivity and openness to students
and their progress), and course organization and classroom management.
Involvement in research seemed to contribute to content mastery, less
so to rapport and course organization. But research commitments did not
detract from teaching effectiveness, and devoting more time to teaching
might not translate into greater effectiveness.
● Many faculty members desired to improve their teaching, but lacked
the necessary knowledge and resources. Successful faculty development
required institutional support at key points—just starting out, for
example, and mid-career. Above all, teaching must be deisolated and
discussed with colleagues, that is, seen as "community property."
QUO VADIS?
A word of caution. Our knowledge about college faculty as instructors
is primarily based on studies of full-time, four-year, liberal arts
faculty, a shrinking segment of the profession. The American
professorate is becoming an occupational group that is increasingly
parttime (40 percent by headcount and rising), that teaches in the
professional and applied fields, and that works in two-year
institutions. The object of inquiry is changing before our eyes almost
before we can get hold of it.
Forthcoming publication of the results of the National Survey of
Postsecondary Faculty1993, conducted by the National Center for
Education Statistics, will provide us with updated information about
faculty career trajectories. The survey's large sample of part-time and
community college faculty will enable us to trace the changing contours
of the academic profession, and to use these changes to strengthen
faculty and instructional development activities in the 21st century.
College Faculty as Teachers by
Martin J. Finkelstein
in the NEA 1995 Almanac of Higher Education, National Education
Association
(pages 35-42; full article:33-47)
Navigating
Student Ratings of Instruction , By: Sylvia d'Apollonia,
Philip C. Abrami, American Psychologist, 0003-066X, November 1, 1997,
Vol. 52, Issue 11
excerpts..........
Student ratings of instruction, first introduced into North American
universities in the mid-1920s (Doyle, 1983), have been the subject of a
huge body of literature including primary studies, reviews, and books.
This literature has dealt not only with the psychometric properties of
student ratings (Costin, Greenough, & Menges, 1971; Doyle, 1983;
Feldman, 1978; Marsh, 1984) and the factor structure of ratings (Kulik
& McKeachie, 1975; Marsh, 1991a, 1991b; McKeachie, 1979) but also
with practical guides to faculty evaluation (Arreola, 1995; Braskamp
& Ory, 1994; Centra, 1993). Many researchers have concluded that
the reliability (Centra, 1993) and the validity (Abrami, d'Apollonia,
& Cohen, 1990; Cohen, 1981; Feldman, 1989, 1990; Marsh, 1987) of
student ratings are generally good and that student ratings are the
best, and often the only, method of providing objective evidence for
summative evaluations of instruction (Scriven, 1988). However, a number
of controversies and unanswered questions persist. Our objective in
this article is to briefly and selectively review some of the
literature on student ratings of instruction that addresses the
following three questions: What is the structure of student ratings of
instructional effectiveness? Is instructional effectiveness
substantially correlated with other indicators of instructor-mediated
learning in students? To what extent do student ratings confound
irrelevant variables with instructional effectiveness?
......
The average reliabilities of the student rating and achievement
instruments [usually exam results]
in the multisection validity studies were . 74 and . 69, respectively.
Therefore, when the correlation coefficient between student ratings of
general instructional skill and student learning was corrected for
attenuation (Downie & Heath, 1974), the correlation became . 47,
with a 95% confidence interval extending from . 43 to . 51. Thus, there
was a moderate to large association between student ratings and student learning, indicating that
student ratings of General Instructional Skill are valid measures of
instructor-mediated learning in students. The variability in the
validity coefficients reported by the primary investigators can be
explained, in large part, by differences in the sampling variance of
the individual studies.
........
In conclusion, the published multisection validity literature suggests
that under appropriate conditions (all instructors are faculty members,
evaluation is carried out prior to students’ knowing their final grade,
sections are equivalent in terms of student prior ability or
equivalence is experimentally controlled) and the validity coefficient
is corrected for attenuation, more than 45% of the variation in student
learning among sections can be explained by student perceptions of
instructor effectiveness. This 45% figure is, of course, an estimate of
validity under the above circumstances. In practice, with the usual
reliability of measures and the ratings context, only partially
matching the set of conditions, the validity coefficient will be
different. In any case, the data set is homogeneous, indicating that
across different students, courses, and settings, student ratings are
consistently valid.
.........
Laboratory studies that have used the "Dr. Fox" paradigm have shown
that instructors’ expressivity (Abrami, Leventhal, & Perry, 1982;
Abrami, Perry, & Leventhal, 1982) and grading practices (Abrami,
Perry, & Leventhal, 1982) can unduly influence student ratings of
instruction. However, these studies also have indicated that these
variables do not seriously bias student ratings. Expressivity has a
practically meaningful influence on student ratings of instructors,
with high-expressive instructors scoring about 1. 20 standard
deviations above low-expressive instructors. However, instructors’
expressivity also influences student learning. Researchers who have
investigated instructors’ expressivity have concluded that it is not a
biasing variable but rather that it exerts its influence by affecting
student learning (Murray, Rushton, & Paunonen, 1990). Liberal
grading practices increased student ratings, at most, by slightly less
than 0. 5 on a 5-point scale. Moreover, in some cases they decreased
ratings. Thus, grading practices are not a practical threat to the
validity of student ratings. In both cases, controlling for the
putative biasing variable "overcorrects" the influence of the putative
biasing variable, removing some of the genuine influence of effective
instruction from the rating scores. In effect, poor instructors may be
doubly rewarded both by students and by evaluators. Instead, we
recommend that student ratings not be overinterpreted. In general,
experts recommend that comprehensive systems of faculty evaluation be
developed, of which student ratings of instruction are only one, albeit
important, component (Arreola, 1995; Braskamp & Ory, 1994; Centra,
1993). Within such a system, student ratings should be used to make
only crude judgments of instructional effectiveness (exceptional,
adequate, and unacceptable).
Five questions that must be included on
all instructor evaluation forms at
Texas A&M at the request of Student Senate according to a
1992 paper. Other items are chosen from a long list by the
instructor.
Table 1.
| Table 1. Core items rated by students. |
| repeat |
I would take another course from this professor. |
| fair grade |
The exams were presented and graded fairly. |
| work amt |
The amount of work and/or reading was reasonable for the
credit hours received in the course. |
| effective |
I believe this instructor was an effective teacher. |
| help |
Help was readily available for questions and/or homework
outside of class. |
Validity Concerns and Usefulness of Student
Ratings of Instruction , By: Anthony G. Greenwald,
American Psychologist, 0003-066X, November 1, 1997, Vol. 52, Issue 11
[excerpts]....................
In summary of the relatively recent literature on student ratings, and
as the following quotes indicate, prominent reviews published since
about 1980 give a clear impression that major questions of the 1970s
about ratings validity were effectively answered and largely put to
rest by subsequent research.
In general, . . . most of the factors [that] might be expected to
invalidate ratings have relatively small effects. . . . Some studies
have found a tendency for teachers giving higher grades to get higher
ratings. However, one might argue that in courses in which students
learn more the grades should be higher and the ratings should be higher
so that a correlation between average grades and ratings is not
necessarily a sign of invalidty. . . . My own conclusion is that one
need not worry much about grading standards within the range of normal
variability. (McKeachie, 1979, pp. 390-391) Probably, students'
evaluations of teaching effectiveness are the most thoroughly studied
of all forms of personnel evaluation, and one of the best in terms of
being supported by empirical research. . . . Although it is possible
that a grading leniency effect may produce some bias in student
ratings, support for this suggestion is weak and the size of such an
effect is likely to be insubstantial in the actual use of student
ratings. (Marsh, 1984, pp. 749, 741) [Recent] evidence has suggested .
. . that rather than signaling possible contamination and invalidity of
student evaluations, the observed relation between grades and student
ratings might reflect expected, educationally appropriate relations.
(Howard et al. , 1985, p. 187) In general, student ratings tend to be
statistically reliable, valid, and relatively free from bias or the
need for control; probably more so than any other data used for
evaluation. (Cashin, 1995, p. 6)
These quotes not only acknowledge that grades and ratings are
correlated but also express the judgment that this correlation can and
should be interpreted without concluding that grades create a
bothersome contamination of ratings.