Test reliability and validity: What SLPs should know
By: Ellen Kester, Ph.D. and Alejandro Brice, Ph.D.
We have all heard the terms “valid” and “reliable” associated with standardized tests. What exactly do those terms mean? How do I know how valid and reliable a test is? Is it my responsibility as a speech-language pathologist to calculate validity and reliability?
What are validity and reliability?
Generally speaking, validity is an estimate of whether a test measures what it purports to measure, and reliability refers to how consistently the test measures what it measures. Test makers typically do large-scale studies prior to the publication of a new measure that gives the users estimates of validity and reliability. There are a number of ways to estimate validity and reliability. It is common for test developers to report many different types of reliability and validity estimates. Keep in mind that we cannot say that a test is or is not valid or that it is or is not reliable. Instead, we can look at the estimates of reliability and validity to determine whether the estimates are high or low. Estimates can range from 0.0 to 1.0 and estimates of 0.6 and above are considered high. Fornell and Larcker (1981) suggest that constructs should exhibit estimates of 0.5 or higher.
Validity and reliability are not independent of one another. A test must have high estimates of reliability in order to have high estimates of validity. In other words, if the instrument is inconsistent in its measurement, it is likely not measuring what it was designed to measure.
Types of Validity
This is a simple measure in which everyday people judge the measure “on its face.” For example, does an articulation test look, to someone who is not a speech-language pathologist, like it measures the sounds that children produce.
Content validity relates to whether the instrument take all of the content into consideration. For example, an articulation test that only included two sounds would not cover the entire domain of articulation. The judgment of content validity is generally made by someone who is an expert in the field and has knowledge of the content domain.
To estimate criterion-related validity, the instrument in question is compared to another instrument that purports to measure the same thing. For example, the PLS-4 and the CELF-4-Preschool are both tests designed to measure language skills in preschoolers. Criterion-related validity would look at the correlation between scores on similar tests.
Predictive validity estimates the instrument’s ability to predict something. For example, we might expect that a high score on the Goldman-Fristoe Test of Articulation-2 would predict a high level of intelligibility in running speech at a future time.
Concurrent validity is an estimate of an instrument’s ability to distinguish between groups that are different. For example if we use the Goldman Fristoe Test of Articulation-2 to evaluate a group of children who are difficult to understand and a group of children who are highly intelligible, we would expect the scores of the two groups to be very different.
Convergent Validity estimates the degree to which the instrument is similar to other instruments that it should be theoretically similar to.
Disciminant Validity estimates the degree to which the instrument differs from instruments that it should not be theoretically similar to.
Reliability indicates the extent to which individual differences on test scores are attributable to “true” differences versus chance errors. In other words, how much of the total variance is accounted for by true variance? If a test cannot provide sufficiently accurate or consistent or reproducible scores, then it will neither correlate highly with other variables nor provide a useful means of inference about underlying constructs. In its broadest sense, reliability indicates the extent to which individual differences on test scores are attributable to “true” differences versus chance errors.
Types of Reliability
Does a subject perform similarly on two administrations of the same instrument? Test-retest reliability is estimated with the correlation between two test scores for the same test for a group of subjects.
Correlation between alternate and equivalent forms of a test provide an estimate that avoids many of the difficulties (e.g. a practice effect) associated with a test-retest approach. However, even with alternative forms, practice effects can confound the reliability estimate.
Internal Consistency Estimates
There are a number of ways to estimate the reliability of a test from a single test administration. One approach is to correlate responses on odd-numbered items with even-numbered items–a “split half” reliability estimate.
For most tests, errors due to scorer differences are not a significant factor. This is not true, however, for tests that are more subjective such as language samples and projective testing. As with any correlation coefficient, a reliability coefficient is dependent on the variability of the sample used. One cannot assume that the coefficient computed on a heterogeneous sample will hold for a homogeneous sample.
Another important factor to consider when selecting a testing instrument is the norm group. Test developers select a group of subjects who are administered the test. The results of their testing is what establishes the standard scores, percentiles, etc. Some important things to consider in looking at norm groups are:
- Representation: This is the extent to which the group is characteristic of a particular population. Those factors generally thought to be most important are: age, grade level, gender, geographic region, ethnicity, and socioeconomic status.
- Size: The number of subjects in a group should be at least 100 per cell for standardizing a test.
- Relevance: For some purposes, national norms may be most relevant. In other cases, norms on a specific subgroup may be most relevant.
McCauley and Swisher (1984) reported 10 psychometric criteria that should be used to review tests.
- The test manual should clearly define the standardization sample so that the user can examine the test’s appropriateness for a particular test taker.
- An adequate sample size should be used in the standardization sample. Subgroups should have a sample of 100 or more.
- The reliability and validity of the test should be promoted by the use of systematic item analysis during the test construction and item selection. To meet the criteria, the manual needs to report evidence that quantitative methods were used both to study and control item difficulty.
- Measures of central tendency and dispersion should be reported in the manual.
- Evidence of concurrent validity (shows that right now this test can demonstrate that it will discriminate normal vs. abnormal) should be reported in the manual.
- Evidence of construct validity should be supplied in the test manual.
- Evidence of predictive validity (shows that the test can predict later performance on another valid instrument) should be in the manual.
- An estimate of test-retest reliability should be provided.
- Empirical evidence of interexaminer reliability should be provided.
- Test administration should be described sufficiently to enable the test user to duplicate the administration and scoring procedures.
Based on McCauley and Swisher (1984), the Hays Consolidated Independent School District in Central Texas developed a Test Evaluation Form for use in their district.
- Test Name:
- What is the cost of the test?
- Date Published:
- Ages it assesses:
- What areas are assessed by the test/subtests?
- What are the response modes of the test/subtests?
- What are the task demands of the test/subtests?
- Describe the norm group including demographics, ages, how many in the whole norm sample and in sub-groups if applicable. Look for the numbers in the technical section but also look at how many norm groups there are for those ages.
- What are the derived scores it uses? What are the mean and standard deviation for the test and sub- tests (if applicable) you use when interpreting a student’s standard score?
- What types of validity coefficients are reported for this test?
- What types of reliability coefficients are reported for this test?
- When looking at the ability of this test to discriminate disorders from average language learners, what were the differences of means between the control group and the clinical group?
- What specific abilities are needed by the examiner to administer the test? What materials are needed? Are they provided?
- In your opinion, would you purchase this test to use?
Fornell, C. & Larcker, D.F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18,
McCauley, R. J. & Swisher, L. (1984). Psychometric review of language and articulation tests for preschool children. Journal of Speech and Hearing Disorders, 49, 34-42.
This Month’s Featured Authors: Alejandro Brice, Ph.D., CCC-SLP and Ellen Kester, Ph.D.
Dr. Alejandro E. Brice is an Associate Professor at the University of South Florida St. Petersburg in Secondary/ESOL Education. His research has focused on issues of transference or interference between two languages in the areas of phonetics, phonology, semantics, and pragmatics related to speech-language pathology. In addition, his clinical expertise relates to the appropriate assessment and treatment of Spanish-English speaking students and clients. Please visit his website at http://scholar.google.com/citations?user=LkQG42oAAAAJ&hl=en or reach him by email at [email protected]
Dr. Ellen Kester is a Founder and President of Bilinquistics, Inc. http://www.bilinguistics.com. She earned her Ph.D. in Communication Sciences and Disorders from The University of Texas at Austin. She earned her Master’s degree in Speech-Language Pathology and her Bachelor’s degree in Spanish at The University of Texas at Austin. She has provided bilingual Spanish/English speech-language services in schools, hospitals, and early intervention settings. Her research focus is on the acquisition of semantic language skills in bilingual children, with emphasis on assessment practices for the bilingual population. She has performed workshops and training seminars, and has presented at conferences both nationally and internationally. Dr. Kester teaches courses in language development, assessment and intervention of language disorders, early childhood intervention, and measurement at The University of Texas at Austin. She can be reached at [email protected]
PediaStaff is Hiring!All Jobs
PediaStaff hires pediatric and school-based professionals nationwide for contract assignments of 2 to 12 months. We also help clinics, hospitals, schools, and home health agencies to find and hire these professionals directly. We work with Speech-Language Pathologists, Occupational and Physical Therapists, School Psychologists, and others in pediatric therapy and education.