# CATS Academic Test Inflation

Since its inception in 1999, serious concern has been raised about the trustworthiness of the academic scores from Kentucky’s public school Commonwealth Accountability Testing System, or CATS.

## Contents

## Proficiency Rate Gaps with the NAEP

A series of Bluegrass Institute reports (find them by searching http://www.bipps.org for the keyword “CATS”) have discussed various sets of evidence that the academic tests in the CATS program, properly known as the Kentucky Core Content Tests, or KCCT, have low grading standards. This is now very obvious for the key NCLB subjects of reading and mathematics, where the divergence in the reported proficiency rates from CATS and The National Assessment of Educational Progress (NAEP) has steadily widened, as shown in Figure 1.

The red bars in Figure 1 show the combined reported percentage of students scoring “Proficient” and “Distinguished” on CATS, while the blue bars show the percentage of students who scored “Proficient or More” according to the NAEP. The CATS data comes from the Briefing Packet, State and Regional Release, Commonwealth Accountability Testing System (CATS) Accountability Cycle 2006 and the 2007 version of the same document. The CATS percentages of “Proficient” and “Distinguished” from those reports were summed to develop the overall proficiency rate numbers in Figure 1. The NAEP data comes from the link shown in Figure 1.

The years shown Figure 1 were selected to include the first year of NAEP testing following the 1999 launch of CATS up to the most recent NAEP testing year of 2007.

It is clear in Figure 1 that while CATS has reported notable progress in fourth grade reading proficiency, the NAEP reports Kentucky’s fourth grade reading performance has been virtually flat. In consequence, the proficiency rate gaps between CATS and NAEP grew substantially over the half-decade of data shown. Because the national assessment is very carefully constructed to maintain scoring accuracy over time, and because there isn’t as much pressure to inflate scores on the national assessment as there is on Kentucky’s high-stakes CATS, it is reasonable to interpret Figure 1 as compelling evidence that the CATS scoring has indeed been more and more inflated over time.

It should be noted that the worst NAEP to CATS gap trend isn’t for fourth grade reading. Rather, it is for eighth grade reading, as shown in Figure 2, which relies on the same data sources as Figure 1. Not only did the proficiency rate gap between NAEP and CATS grow from 24 to 38 points during the time shown, but while CATS was posting a 10 point increase in proficiency rates, Kentucky’s reading proficiency rate on the NAEP actually made a statistically significant decline from 32 to 28 percent.

If you would like to see graphs of the NAEP to CATS proficiency gaps for mathematics and writing, check out slides 9, 14 and 15 in this PowerPoint Presentation. You will see that the trends over time are similar to those shown in Figures 1 and 2, with the CATS reporting significantly higher proficiency rates than the NAEP reports as of 2007 for all subjects of math, reading and writing.

## NAEP Ruler Also Shows Inflation in CATS Over Time

Another way to explore the grading inflation in CATS is to use the “NAEP Ruler” developed by Richard Innes, the education analyst for the Bluegrass Institute. The NAEP ruler measures how the grading rigor in CATS compares to the NAEP’s scoring categories of “Proficient” and “Basic,” which is the next lowest NAEP achievement level score. NAEP “Basic” is somewhat similar to the score of “Apprentice” in the CATS system.

A detailed explanation of the “Ruler” can be found in CATS in Decline: Federal Yardstick Reveals Kentucky’s Testing Program Continues to Deteriorate but a quick explanation is provided here using Figure 3.

### Constructing A NAEP ruler

To construct a NAEP ruler for a given assessment year and subject, the percentages of Kentucky students who score at the “Proficient” or above and “Basic” or above level on the NAEP are plotted on the same, equal interval scale. In our example, these two percentages are shown by the green and yellow arrows. Then, a new scale, running from 0 to 100 is added as shown by the NAEP Ruler section in Figure 4. Next, the percentage of students who score “Proficient” or above on CATS is added, as shown by the blue arrow. If the blue arrow falls close to the green arrow, then the NAEP Rigor Ratio of CATS to NAEP is relatively high, and will be close to 100. On the other hand, if CATS is graded a lot easier than the NAEP, then the NAEP Rigor Ratio is going to be fairly low. In our example, the Rigor Ratio is only about 32.

The key point to remember is a NAEP Rigor Ratio near 100 indicates CATS scoring is fairly comparable in difficulty to the respected NAEP. A Rigor Ratio near zero, or even below zero, indicates CATS is using very watered down grading standards.

### NAEP ratio trends in elementary schools

Now, with this discussion complete, Figure 4 shows how NAEP Rigor Ratios have trended over time for elementary school reading in Kentucky

As you can see, the grading rigor for elementary school reading in CATS started out very low in the early years, with CATS scoring students “Proficient” when their actual performance was very close to what the NAEP defines as “Basic,” which is only partial mastery of the subject. Then, things got even worse, so that by 2007 what CATS calls “Proficient” is notably below the level that NAEP designates as only partial mastery of the material. The NAEP rulers for fourth grade math and eighth grade math and reading don’t show quite such loose standards in CATS, but in all cases it is clear that using NAEP as a ruler shows CATS scoring across the board started getting easier at least by 2005, if not earlier. Thus, the CATS scoring has not been stable over time.

You can see more graphs like Figure 4 covering math and reading and science for both fourth and eighth grades by looking at this document.

## Comparing CATS to Itself Over Time

Even comparing CATS to itself over time shows strong inflation caused by the new, 2007 resetting of scoring standards.

Consider the growth in the CATS proficiency rates shown by the red bars in Figure 5. It is obvious that the middle school math CATS proficiency rate took a huge jump in 2007. That jump was caused by a resetting of the CATS scoring standards in 2007, which in this case obviously watered down the rigor of CATS.

It is possible to analyze trends in CATS scores between 1999 and 2006 under the initial CATS scoring program using standard analysis techniques called linear regression analysis. That analysis allows us to determine the average annual change in the scores.

Then, that average annual change during the “Old CATS” period can be compared to the one-year score jumps that occurred between 2006 and 2007 as the state switched over to the new CATS scoring standards. When that is done, the inflation in the “New CATS” is very evident.

### Actual vs. Expected change

In Figure 6, we compare the actual change in the individual subject CATS academic indexes between 2006 and 2007 to the change that would be expected from the trend in the scores from 1999 to 2006. If the 2006 to 2007 change exactly matches what the 1999 to 2006 history predicts, the red bar will just touch the 100 percent line. However, if the 2006 to 2007 change is more than expected, then the red bar and the associated percentage extend far above the expected level of 100 percent.

As you can see in Figure 6, the inflation in the resetting of the CATS middle school scoring scales was extensive. For example, in reading the actual change was not the expected 100 percent, but rather it was 330 percent, which indicates the score inflated by 230 percent, or more than twice what is predicted by the history for middle school reading under the “Old CATS.”

It should be noted that in 2007 the Writing score was separately reported for the Writing Portfolios and the On-Demand Writing Assessment elements. The scores for both were averaged together to create the input data for the total Writing bar in Figure 6.

Overall, the middle school Total Academic Index, which is what is factored into each school’s final CATS Accountability Index, grew by 401 percent more than predicted. Figure 6 makes it clear that the resetting of CATS scores in 2007 produced a notable further watering down of standards in Kentucky’s school assessment program.

The inflation in the elementary schools’ CATS scoring in 2007 looks very similar to the middle school’s discussed above. However, the high school change in On-Demand Writing scoring was not ready for 2007 and didn’t take effect until 2008. Thus, Figure 7 has been modified somewhat to include the writing change from 2007 to 2008 to provide a better representation of how the CATS scoring changes really impacted the high schools. Again, as with Figure 6, the Writing bar in Figure 7 is a total writing score developed by averaging the Writing Portfolio and separate On-Demand Writing scores.

You can see a similar presentation covering the elementary schools by looking at slide 27 in this PowerPoint Presentation. Please note that the slide 29 in this PowerPoint does not accurately reflect the change in On-Demand Writing scoring for high schools in 2008 as discussed above.

Overall, it is clear that scoring inflation caused by the standards resetting in 2007 destroyed the old CATS trend lines.

## CATS Compared to EPAS

EPAS is a new set of coordinated tests from the ACT, Incorporated that are given to all Kentucky students in the eighth, tenth and eleventh grades. The tests and associated grades of administration are:

- Grade 8 – EXPLORE
- Grade 10 – PLAN
- Grade 11 – ACT (the College Entrance Test)

The EPAS system reports scores in several different ways, but the scoring scheme of interest here is related to the new ACT “Benchmark Scores” that indicate a student has about a 75 percent chance of earning a “C” and a 50 percent chance of a “B” in the related freshman college course. For example, if a student scores at or above the Benchmark for math on an EPAS test, he or she is likely to pass freshman algebra with an acceptable, if not outstanding, grade at a typical two-year or four-year college.

The Benchmark scores for the ACT were empirically derived from a survey of colleges and universities that use the ACT. They are not based solely on educator opinion – the way that the CATS scoring scheme is set.

Several other points are worth mentioning. The ACT has a rich history of cooperation with business and industry in determining those skills that are needed for better paying non-college track skilled, technical and semi-skilled jobs. The ACT recently surveyed its business connections and learned that today most better-paying non-college jobs now require about the same skills that an entering college freshman needs. Thus, it isn’t totally unreasonable to consider the ACT benchmarks as a useful, though certainly not fully revealing, indicator of student preparation for life beyond high school.

### ''Proficient'' CATS vs. EPAS Benchmark

Now, here are some comparisons of the percentage of students scoring “Proficient” or more in several CATS subjects and the percentages of students that scored at or above the Benchmark score in the related EPAS test. Figure 8 shows how middle school students in Kentucky compare on these two assessments. CATS data comes from the Briefing Packets discussed earlier. The ACT results come from Bluegrass Institute analysis of the cumulative percentage chart (Table 1a) versus the Benchmark Cut Scores in the ACT, Incorporated’s “EXPLORE PROFILE SUMMARY REPORT, KENTUCKY EPAS, 2007 − 2008, CODE 00000018, TEST DATE: 09−2007, NAT’L NORM GROUP: Fall 8th Grade,” which is not available on line.

Notice that in both math and reading, the results are very similar to what we saw earlier on the NAEP. CATS proficiency rates are about double the percentages of students who reach the Benchmarks. In science, the situation in CATS is even more inflated. Figure 9 shows the same data for high schools. Again CATS scores come from the earlier mentioned briefing packets. The derivation of the proportion of students scoring above the PLAN Benchmarks was conducted by the Bluegrass Institute using the cumulative percentage chart (Table 1a) in “PLAN, Profile Summary Report, KENTUCKY, (STANDARD & TIME-EXTENDED), Code 8818ST, 2007 – 2008,” which is also from the ACT, Incorporated and also is not on line.

Note that with the exception of reading, the results are similar to the eighth grade results. In reading, the high school students did a little better, but not much.

## Writing Portfolio Audit Shows Strong Inflation

Additional evidence of inflation is CATS comes from annual audits of the scoring of the Writing Portfolios part of the assessment. These audits have consistently reported significant grading errors, especially for the two highest score categories of “Proficient” and “Distinguished.”

To begin, look at Figure 10, which shows results of the 2007-2008 Writing Portfolio audit for middle schools. This figure comes from page 19 in the “Kentucky Commonwealth Accountability Testing System 2007-2008, Writing Portfolio Audit Report,” from the Kentucky Department of Education, which is not available on line. Figure 10 explains the steps needed to read the audit results.

The original scoring is shown by the rows, so to determine how many students were originally scored “Distinguished” by their teachers, (1) enter the table on the left at the row marked “Distinguished.” Then, (2) go all the way across horizontally to the “Total” column to see that in the original scoring, 246 students were graded “Distinguished.” Next, to see how many students retained that “Distinguished” score after the audit, (3) read back to the left horizontally to the column labeled “Distinguished.” Notice that only 18 of the 246 students, just 7.32 percent, retained this top score after the audit. To see how many of the original “Distinguished” scores were downgraded to only “Proficient,” (4) read further left to the column labeled “Proficient” to see that many students in this group, 160 of the original 246, or 65.04 percent, were downgraded.

In fact, if you continue to read to the left, you will see that 64 students, or 26.02 percent, were actually downgraded by two score levels to “Apprentice.” Thus, there actually were more true “Apprentice” students in the original group of “Distinguished” students than students who actually deserved the top score. You can see tables similar to those in Figure 10 covering all school levels in slides 62 to 64 in this PowerPoint Presentation.

How much inflation occurred in the portfolio scores? Under the scoring scheme in CATS, the various scores of “Novice,” “Apprentice,” “Proficient” and “Distinguished” receive weights of 0, 60, 100 and 140 respectively. Figure 11 will help explain how the original scores for the middle school sample in Figure 10 would be scored.

### Calculating the CATS Academic Index Score

The first step in calculating the final CATS Academic Index Score for the writing portfolios is to multiply the number of students in each scoring category by the weight that scoring category is assigned in CATS. For example, the 68 students originally scored as “Novice” in the portfolio audit sample get a weight of zero, for a product of zero. Since “Apprentice” is weighted as 60, the product for these students is 83,040. Since an Incomplete portfolio is a non-performance, these should also receive a zero score.

Next, the total number of students and the total sum of the products are computed. That sum of the products is then divided by the total number of students to calculate the Academic Index. Figure 12 shows a similar table for the audited scoring.

Note that the audited Academic Index is 16.62 points lower than the original scoring awarded by teachers. That notable difference can have an impact on whether a number of Kentucky schools were actually in the “Meets Goal,” “Progressing” or “Assistance” category in CATS. Such scoring error therefore represents unacceptable inflation in the original portfolio scoring.

Figure 13 summarizes the Bluegrass Institute’s calculated differences in the original academic index for writing portfolios and the index that results from the audit-assigned scores. The least error occurs in high schools, while middle schools have much higher errors in the CATS writing score for portfolios due to the clear inflation in teacher scoring.

## Portfolio Scoring Getting Worse Over Time for Most Schools

Figure 14 summarizes Bluegrass Institute analysis of the writing portfolio accuracy over time by school level. This is developed from the “Percent of Exact Agreement” figures published with the tables in the audits for 2005-2006, 2006-2007 and 2007-2008 from the Kentucky Department of Education (again, none on line).

Inspection of Figure 14 makes it clear that writing portfolio scoring accuracy was worse in elementary and middle schools during the last year of this three-school-year period than at the beginning. Only the high schools have made some modest improvement in scoring accuracy, but the 2007-2008 percentages indicate that about one in three high school students is getting the wrong portfolio score, and the earlier discussion in this Wiki post shows that error is inflating the scores. The situation is worse in the lower level schools.

## Summary

- CATS math, reading, and writing proficiency rates are now MUCH higher than rates reported on the NAEP.
- CATS reading, math and science proficiency rates are MUCH higher than the percentages of students reaching the EPAS Benchmark scores.
- CATS scoring versus the NAEP has been getting easier over time in math and reading.
- Resetting CATS scoring in 2007 resulted in notable new score inflation, with the most significant impact on overall academic indexes occurring in middle schools.
- Writing portfolio scoring is clearly problematic, with significant disagreements continually appearing between originally awarded scores and the results of portfolio audits.