The National Assessment of Educational Progress
(NAEP) "The Nation’s Report Card"
The NAEP is a federally administered academic testing program for school systems throughout the nation. NAEP documents often refer to the assessments as "The Nation’s Report Card."
The NAEP has been of considerable interest in many states, including Kentucky, as it generally offers the only state-to-state comparisions available for fourth and eighth grade academic performance. However, there are often considerable problems involved with making these comparisons, as discussed below.
The NAEP is operated by the US Department of Education at the direction of the Congress. It is administered by the National Center for Education Statistics. Since 1988, NAEP policy has been determined by the congressionally created non-partisan National Assessment Governing Board.
Over the years the NAEP has periodically assessed various academic areas.
The NAEP began in 1969 as a strictly nation-wide test, prohibited by law from producing scores for either individual states or local school jurisdictions. The testing samples were drawn from across the entire nation in such a way that the results would actually provide invalid scores even if the students from each state could be separately identified. In succeeding years, more testing has been added to cover both state level results and, most recently, results for some of the nation’s largest urban school districts.
The NAEP is a sampled assessment. When only a national sample is taken, around 10,000 to 20,000 students from across the nation are tested. When state samples are conducted, around 3,000 students or less are tested in each state. The students from each state generally come from around 100 schools (see Frequently Asked NAEP Questions Number 2).
For more information, readers can access An Introduction to NAEP.
- However note that some critics take issue with some of the presentations in this NAEP publication. For example, the depiction of how test booklets are distributed in individual classrooms is not the norm, as science is not given nearly as frequently (generally about once every five years) as math and reading (two-year cycle at present). Also, as will be shown below, the quote found on page 5 that “NAEP makes state-to-state comparisons reliable,” is not universally accepted.
Over the years the NAEP has periodically assessed various academic areas such as mathematics, reading, science, writing, the arts, civics, economics, geography, and U.S. history. So far, State NAEP has assessed only math, reading, writing, and science. Subjects tested by the Trial Urban District Assessment are even more restricted, at present.
A listing of past administrations of the NAEP can be found here.
And, the scheduled NAEP testing for 2008 and beyond can be found here.
Types of NAEP Assessments
There are two primary series of NAEP assessments. The longest running report series available today is called the Long Term Trend NAEP. Until recently, the Long Term Trend NAEP kept the same testing format and protocols that were used in the early days of NAEP.
The other primary series of NAEP tests are called "Main NAEP". These tests have seen more variation in their question formats and other features as the NAEP’s managers reacted to changes in education theories over the past 40 years.
In 1990, in response to growing pressure for state level information, the Congress changed the law and the first "State NAEP" was administered as part of the Main NAEP. This first State NAEP assessment was only conducted in eighth grade math. Two years later, State NAEP assessments were conducted in both fourth and eighth grade math and fourth grade reading. More State NAEP assessments were later added in writing and science in both fourth and eighth grades. Kentucky has continuously participated in State NAEP since its inception.
Most recently, NAEP added reporting for some of the nation’s largest metropolitan school systems. Known as the "Trial Urban District Assessments", or "TUDA", these are the latest additions to the growing series of NAEP products. Only one Kentucky district, Jefferson County Public Schools (Louisville) could qualify for TUDA, but as of 2007 has not been a participant.
How to Find NAEP Results
NAEP results are available in a large number of formats varying from full, published documents to several Web tools that allow users to custom design their own report tables.
- NAEP Reports by Academic Subject The most commonly referenced NAEP documents, and the easiest to access and understand, can be accessed by starting at the NAEP Home page.
First click on the academic subject of interest, for example, "mathematics." That brings up a new page that lists a number of different reports on this area. For example, under a section titled "Mathematics Results," you can click on a link that will download a copy of the "Mathematics Report Card."
Notice that more reports can also be accessed from this single mathematics Web access page. For example, links are available here to the latest Long Term Trend NAEP Mathematics assessment and the latest Trial Urban District Assessment in mathematics is also available. Individual state reports can also be accessed through this single page portal. It should be noted that the Report Cards also have data on all 50 states’ results if a state-level NAEP was administered in the subject.
- NAEP Data Explorer For those who want to assemble their own data tables, the NAEP Data Explorer offers a powerful tool that can access much more information than can be found in the Report Cards and other published documents. For example, the Data Explorer allows users to custom assemble tables containing multiple years of data which look at performance for selected student subgroups that are not available in the standard Report Cards. Users can also perform tests of statistical significance for different results, say between different states, that often are not presented in the Report Cards.
Due to its complexity, new users should view the tutorial before starting to use the NAEP Data Explorer. The tutorial can be accessed from a link on the NAEP Data Explorer home page.
- Long Term Trend Data Explorer For those who want to access this NAEP series data, a separate Long Term Trend Data Explorer is used. Operation is similar to the regular data explorer, and the tutorial is also recommended as a first step to using this powerful, but somewhat complex, tool.
The NAEP reports scores using two primary score formats, NAEP Scale Scores and NAEP Achievement Level Scores.
- NAEP Scale Scores use either a 0 to 500 (e.g. math and reading) or a 0 to 300 scale (e.g. science and writing).
- NAEP Achievement Level Scores are similar to scores for Kentucky’s public school assessments. They are reported as four levels (see Page 8 in 'The Nation’s Report Card: Mathematics 2007 (NCES 2007–494).
- Below Basic – Performance below the Basic level
- Basic -- Denotes partial mastery of prerequisite knowledge and skills that are fundamental for proficient work at a given grade.
- Proficient -- Represents solid academic performance. Students reaching this level have demonstrated competency over challenging subject matter.
- Advanced -- Represents superior performance.
For more about score reporting click here
Issues with NAEP Scale Scores
The scale scores are not published in direct terms of number of points possible versus achieved or as percentiles. Instead, the scores for the 500-point assessments are reported on a converted scale which was originally established so that if all students in all school levels took the assessment that the mean score would be 250 with a standard deviation of 50 (See page 7 in The Nation’s Report Card: Mathematics 2000, NCES 2001–517 for an example description of the scoring scale).
This involved scoring method makes interpretation of the scores, especially scoring differences, difficult for laymen to understand.
Adding to the confusion, as originally designed, the intent was that scores for all grade or age levels assessed would be reported on one, common scale. Per this theory, if a fourth grader scored high enough, he would actually be deemed to have outperformed an eighth grader. A similar situation would apply for a high scoring eighth grader, who might be deemed to have outperformed some students in the 12th grade.
The concept of common scale scoring led a number of researchers to believe that a NAEP score difference of 10 points was roughly equal to an extra year of school. As discussed in the next paragraph, that is not correct.
As the NAEP evolved, evidence began to accumulate that the common interval theory was not working out well in practice. For example, in the 1998 NAEP Reading Assessment, fourth grade students scored 217, eighth grade students scored 264, and 12th grade students scored 291 (See Figure 1.1 in The NAEP 1998 Reading Report Card for the Nation and the States, NCES 1999–500). The difference between the fourth and eighth grade score is 47 points, but the middle to high school difference is only 27 points. Clearly, the concept of common scale scoring has not worked out with real students and real test results. Furthermore, the concept that a 10-point difference on the NAEP is equivalent to an extra year of study isn't supportable, either.
Issues with NAEP Achievement Level Scores
The NAEP Achievement Level scores are actually derivatives of the scale scores. In general, a team is assembled to determine which “cut scores” on the scale scores correspond to the definitions used for the Achievement Levels. Thus, any problems in the scale scores will be reflected in the Achievement Level Scores, as well.
However, criticism of the NAEP Achievement Level Scores runs deeper. A number of studies have criticized the methodology of setting the cut scores and the implications derived from this scoring system. As a result, the following statement appears in all the latest NAEP Report Cards:
“As provided by law, NCES, upon review of congressionally mandated evaluations of NAEP, has determined that achievement levels are to be used on a trial basis and should be interpreted with caution. The NAEP achievement levels have been widely used by national and state officials.” From: The Nation’s Report Card: Mathematics 2007 (NCES 2007–494)
Pitfalls in Interpreting NAEP Scores
The NAEP is the only currently available testing program that conducts generally equivalent testing in all 50 states; however, this does not make it a “gold standard” for conducting comparisons of state to state academic performance. In fact, the federal assessment has a number of validity issues which create serious pitfalls in the validity of simplistic comparisons of scores between states. These pitfalls can also complicate the analysis of a single state’s performance over time.
Many Reports Ignore NAEP Validity Issues
The vast majority of research studies on the NAEP totally ignore even mentioning possible caveats and potential pitfalls in their analysis due to these issues. It is difficult to excuse that oversight because recent NAEP documents, including the NAEP report cards, are now fairly “up front” about at least a portion of the problem.
For example, a short discussion of several key impediments to fair analysis of state NAEP performance is found in all of the 2007 NAEP Report Cards. An example of some of these comments appears on page 7 of the The Nation’s Report Card: Mathematics 2007 (NCES 2007–494), where it says,
“Variations in exclusion and accommodation rates, due to differences in policies and practices regarding the identification and inclusion of students with disabilities and English language learners, should be considered when comparing students’ performance over time and across states. While the effect of exclusion is not precisely known, comparisons of performance results could be affected if exclusion rates are comparatively high or vary widely over time.”
Exclusion Rate Differences Can Impact NAEP Comparisons
Bluegrass Institute’s education analyst, Richard Innes, played a key role in bringing the NAEP exclusion rate issue to light in 1999 after the 1998 NAEP Reading Assessment results were released. In that assessment Kentucky experienced one of the largest gains in reading scores (increasing from a score of 212 in 1994 to 218 in 1998) of any participating state.
What that 1998 era NAEP report card did not discuss was that Kentucky had also experienced a very sharp rise in exclusion of students with learning disabilities (up from just 4 percent of the entire raw sample in 1994 to 10 percent in 1998, according to the data available at that time).
Innes conducted some very straightforward analysis (discussed in detail in The Troubling Situation With The 1998 National Assessment of Educational Progress (NAEP) 4th Grade Reading Assessment) that implied somewhere between three to as many as six points of the rise in the Kentucky scores was solely due to sharply increased exclusion of many more students who would be expected to get very low scores had they actually been allowed to take the NAEP assessment. Innes showed that, rather than experiencing one of the biggest score improvements on NAEP reading between 1994 and 1998, Kentucky actually may have had perfectly flat performance. Obviously, that was a very different conclusion from the one conveyed by a simplistic comparison of the official scores.
Impact of Variations in Rates of Providing Testing Accommodations Unknown
The introduction of testing accommodations in the NAEP for learning disabled students (known as Individual Education Plan (IEP) students in older NAEP reports) and for English language learners (known as English as a Second Language (ESL) in older NAEP reports) has also raised unresolved validity issues. A very interesting discussion of this problem is found in An Agenda for NAEP Validity Research by the NAEP Validity Studies Panel. The panel’s 2002 report discusses 21 different areas where there are important, unanswered questions about the validity of the NAEP. The panel ranks testing accommodations as one of the most critical issues in need of further investigation.
At present, there isn’t compelling research to show whether the accommodations artificially inflate scores for these students.
Demographic Changes Can Severely Impact NAEP Performance
There are more validity issues for the NAEP listed on Page 7 of the The Nation's Report Card: Mathematics 2007. Those additional cautions read,
“Changes in performance results over time may reflect not only changes in students’ knowledge and skills but also other factors, such as changes in student demographics, education programs and policies (including policies on accommodations and exclusions), and teacher qualifications.”
Among the items in this second listing, research conducted by the Bluegrass Institute indicates that differing changes in student demographics from state to state can have profound impacts on NAEP performance.
A Kentucky NAEP Demographic Example
Differences in student demographics between state and national samples and between various states greatly impact the interpretation of NAEP results. Consider this example, which covers NAEP eighth grade science scores for Kentucky and across the nation.
Science is Kentucky’s top performance area on the NAEP, at least according to the officially reported scores. In 2005 Kentucky got a NAEP eighth grade science scale score of 153, six points higher than the national average of 147. If we simplistically look no further than these overall average scores, one would conclude that Kentucky is doing considerably better than the rest of the nation in teaching science.
But, consider what happens when a reasonable correction is made for the very different makeup of Kentucky’s student demographics versus the national average student body makeup.
To begin, the proportions of the various races in the national and Kentucky NAEP 2005 eighth grade science assessment are shown in this figure.
Notice that Kentucky’s tested student sample is overwhelmingly composed of White students. Fully 87 percent of Kentucky’s NAEP tested students were White. However, across the nation only 60 percent were from this race.
Also notice that in Kentucky all other races except for Blacks were present in very small numbers, proportions so small that scores could not be reported for them due to sampling size issues. However, across the nation, other races make up a notable proportion of the tested sample, and scores for every group are reported. And, in every case, Whites outperformed the other racial groups reported by the NAEP.
To correct for that problem, Richard Innes, the education analyst at the Bluegrass Institute for Public Policy Solutions, computed an average score across the White and Black results only, but he weighted both the national and state scores using the Kentucky demographic of 87 percent White, and 10 percent Black. Innes ignored all the other racial groups because with Kentucky weighting applied, their scores would be statistically unreliable, just as actually happened for Kentucky’s scores.
This graph shows the results. The officially published scores are shown by the blue bars and the scores with demographic corrections are shown by the dark red bars. Note that, as expected, Kentucky’s scores don’t change (within rounding error) because the weighting stays the same and the other races besides Whites and Blacks were virtually absent in Kentucky.
However, the national average scores undergo a remarkable increase once the demographic correction is applied. Once we consider the very strong differences in the demographic makeup of the student group in Kentucky and across the nation, the impression that Kentucky does notably better than the rest of the nation flip-flops in a hurry. All of a sudden, Kentucky winds up three points behind rather than six points ahead of the national average.
Thus, especially in cases where Kentucky’s NAEP scores are compared to performance elsewhere or across the nation, failure to even consider the very sharp differences in the student demographics involved leads to seriously inflated impressions of Kentucky performance.
Certainly, simplistic NAEP comparisons for the state must be considered very carefully in light of the example above. Often, those simplistic comparisons lead to very inaccurate impressions.
California Demographic Example
Richard Innes looked at how huge demographic changes in California’s student population since NAEP state assessments began have very likely severely depressed that state’s more recent scores on the NAEP, hiding what may actually be some rather remarkable performance increases (find the full spreadsheet with Innes’ work here).
Innes’ analysis of California’s NAEP student samples shows the state experienced an avalanche of change in its public school student demographics between 1992 and 2007.
In 1992, the state’s demographic makeup for the NAEP was 51 percent White, 8 percent Black, 28 percent Hispanic, and 12 percent Asian American/Pacific Islander. Two other NAEP racial reporting categories of “American Indian” and “Unclassified” each had only a trivial number of students.
By 2007, the California Demographics had shifted dramatically. Now the makeup was only 28 percent White, 7 percent Black, 52 percent Hispanic, and 11 percent Asian American/Pacific Islander. Because all the groups except Asian/Pacific Islanders score notably below Whites, the impact on California’s overall NAEP average scores was considerable in 2007.
The following table shows two sets of scores that Innes computed to show how considerable that impact was.
The far right column in the table shows scores that are not corrected in any way for demographic changes. These scores agree closely with those actually published for California although Innes’ calculations do not consider scores for NAEP categories of American Indian and Unclassified because the percentages present in California from these groups are insignificant.
The middle column in Innes’ table shows how California’s scores would look if the student demographics in the 1992 NAEP had been constant throughout the rest of the years listed.
Examining the uncorrected scores indicates that California’s reading scale score for all students only improved by 6.1 points. However, if California had not experienced the upheaval in its demographics, the state would have posted a much larger 13.9 point increase in its scale score, a very dramatic difference more than twice the officially reported improvement, and a very strong signal that California education was actually making some very impressive improvements, especially after 1998 when a new curriculum had started to take hold in the Golden State.
The California NAEP demographic experience is very different from Kentucky’s. Kentucky’s NAEP Grade 4 reading samples had very little change in demographics between 1992 and 2007. Whites comprised 90 percent of the 1992 Kentucky sample and 84 percent of the 2007 sample.
Thus, although demographic factors were totally beyond the control of California educators, those demographic factors acted to dramatically suppress the that state's NAEP score improvement over time. Meanwhile, Kentucky’s NAEP performance faced virtually no such problems.
Clearly, ignoring the difference in demographics between Kentucky and California renders simplistic NAEP analysis between the two states virtually without value.
In summary, despite claims that the NAEP is some sort of “gold standard” for comparing educational performance among the states and for individual states over time, there are a variety of unresolved issues that can render simplistic comparisons of NAEP scores highly unrevealing of the true situation. Readers are advised to be particularly careful about conclusions reached in research that fails to even mention the major NAEP validity issues of exclusion rates, accommodation rates and student demographic changes.
Another Demographic Example – New Mexico Versus Kentucky
New Mexico offers another interesting example of how state to state comparisons of NAEP performance can change dramatically once demographic level data is considered.
This first graph shows the overall 2007 NAEP reading scores for Kentucky and New Mexico. In both cases, Kentucky seriously outscores New Mexico.
That image changes dramatically when we look at disaggregated data, however.
This next graph shows that when we compare Kentucky’s Whites to New Mexico’s Whites for 2007 NAEP performance, in both reading and math New Mexico comes out ahead. The same is true when we do an “apples to apples” comparison of Black only performance.
Why does the situation reverse in this way when we disaggregate Kentucky and New Mexico NAEP data?
The answer is that only 32 percent of the students in New Mexico are White, while Kentucky enjoys one of the largest White student populations of any state in the country. In fact, New Mexico has a large, majority population of Hispanic students, and they tend to score much lower than Whites, as well.
When we only look at overall scores for Kentucky and New Mexico, we are doing a very “apples to oranges” comparison. New Mexico’s student demographics are so dramatically different from Kentucky’s that simply comparing overall scores provides highly misleading impressions about the relative performance of both Whites and Blacks in the Bluegrass State.
By the way, due to its enormous population of Hispanics, many of which are still learning English, by the way, New Mexico’s student population has an even higher rate of poverty, based on students enrolled in the federal school lunch program. Here are the poverty rates from the NAEP 2007 grade four math assessment.
Poverty Rates (from Federal School Lunch Data)
So, the standard Kentucky excuse doesn’t work for a Kentucky to New Mexico comparison, either.
Statistical Sampling Issues Complicate State Comparisons
There have been a number of inappropriate attempts to rank Kentucky’s NAEP performance against performance of the other states. The NAEP scores are not nearly accurate enough to support these uninformed efforts.
Prime examples of simplistic and inappropriate state to state ranking of Kentucky’s NAEP performance can be found in a series of poorly performed state rankings that were conducted by the Kentucky Long Term Policy Research Center (Now defunct) such as found here and here.
The Prichard Committee for Academic Excellence also engages in inappropriate rankings of Kentucky’s NAEP performance with its ‘Top 20 by 20’ program in publications such as found here.
Such simplistic rankings of student NAEP scores totally disregard the fact that the NAEP scores have statistical sampling errors. Because of those errors, it is possible for a state to have a higher listed score than another state, but there is no way to be certain that the first state truly outperforms the second one. That makes simplistic ranking such as that from Prichard and the Kentucky Long Term Policy Research Center hopelessly inaccurate.
While such ranking schemes don’t work, it is possible to use the NAEP Data Explorer to generate interesting national maps that show those states with scores that are statistically significantly higher, lower and the same as Kentucky. Here are some I assembled that will give you some idea about how Kentucky really shapes up based on the NAEP.
Mapping Kentucky’s Real Performance on NAEP
Mapping Whites Across the Nation in 2009
This first graphic shows how whites in other states performed against Kentucky’s white eighth grade students on the 2009 NAEP grade 8 math assessment. It was assembled with the NAEP Data Explorer web tool.
States that scored statistically significantly higher than Kentucky are shaded green. Those that tied us are shaded tan, and the lone state that had a statistically significantly lower score, West Virginia (no, NOT Mississippi), is shaded in salmon color.
In particular, note that whites in many other Southern states did better than Kentucky’s whites. This is important, because whites comprise about 85 percent of all public school students in Kentucky.
Mapping Black Performance in 1990 and 2009
The next two maps examine the performance of Kentucky’s black students. The first one covers the first year of grade 8 math testing with the ‘State NAEP,’ which occurred in 1990. It only shows those states that reported scores for blacks that year.
NAEP wasn’t mandatory in 1990, and some states didn’t participate. Also, in some states, blacks comprise too small a group for the NAEP to collect statistically valid scores, so scores are suppressed. The states that didn’t report black scores are shown in light blue.
The second of the two maps with black data shows similar results for 2009. Again, only states that had scores in 1990 are considered for consistency.
Because blacks comprise a relatively small population in a number of states, even when scores are reported, the statistical sampling errors in NAEP are pretty large. Thus, many states don’t have large enough differences in scores from Kentucky for us to definitely declare the scores are different with a high level of confidence. So, many states are shown in tan in both of the maps.
But, there are differences. In 1990, no state had black scores statistically significantly higher than Kentucky. By 2009, blacks in 8 other states were doing better than the blacks in the Bluegrass State.
In 1990, 8 states definitely had lower scores for blacks than Kentucky. By 2009, that number declined to just 5 states.
Thus, NAEP tells us that blacks in Kentucky lost ground compared to their counterparts in other states that participated in the NAEP in 1990.
Maps for Students in Poverty
The next two maps are similar to the previous ones with one, important change. Both only show performance for students who qualify for the federal school lunch program, a commonly used measure of poverty in school analysis.
This first map is for poor whites only. Again, only West Virginia definitely has lower scores than Kentucky’s poor whites. A number of Southern states such as North and South Carolina and Texas also definitely had white performance that was definitely better than the Bluegrass State, and whites in MANY other states also definitely did better.
This last map shows NAEP grade 8 math performance for poor black students only.
Only 3 listed jurisdictions, Michigan, Alabama and Washington, DC definitely scored lower. Three states, Texas, Virginia and Massachusetts definitely score higher. The NAEP just isn’t sensitive enough to detect any difference between Kentucky’s poor blacks and poor blacks in all the other states shown in tan.
Which brings us back to the main point. While these maps have value, they also provide good evidence that simplistic state to state ranking with the NAEP simply isn’t valid. Those who do such rankings are totally ignoring the math of statistics.