Food for Thought: The Lack of Validity and Reliability with Grades
- Heather Lyon 
- Apr 3, 2024
- 7 min read
Updated: Apr 4, 2024
Hello,
Three years ago, in the post, “Making (Up) the Grade,” I fired a shot across the bow when I asserted the grades students get in school are made up. I ended the post by saying:
Report cards and grades in general are not a science—they are an art. They are subjective. They are arbitrary and capricious. My report card grades had some reflection of my efforts and knowledge, but the actual numbers out of 100 were wildly invalid and unreliable. This is why my report card average of 96.66 meant nothing when compared to 96.33 since those numbers actually had as much (and I would argue more) to do with how the teachers graded me, rather than a true reflection of my knowledge of the content.
My point is as apt today as it was when I wrote it. Grades on individual assignments that are used to determine a student’s overall grade on report cards are invalid and unreliable.
Valid and Reliable
To better understand validity and reliability, I’m going to take us out of the education realm and take you into my kitchen. Since I love to cook and bake, I use my oven a lot. Here’s the thing about my oven…it’s reliable, but not valid.
- Reliability is the extent to which the outcomes are consistent when repeated. 
- Validity is the extent to which the results measure exactly what you want to measure. 
I know every time I use my oven that if I set the temperature according to the recipe, the food will be in the oven longer than the recipe states. The reason is clear. My oven runs cold. The number I set it to is not the actual temperature inside the oven. So, if the recipe says to set the oven to 350 degrees and bake for 15 minutes, then if I set the oven to 350 degrees, the food will need at least 20 minutes to bake.
Below is a table that represents all of the possible iterations of validity and reliability.
| Invalid and Unreliable | Invalid and Reliable | 
| The oven never is the right temperature When set at 350 degrees, a thermometer would never read 350 degrees | The oven is never the right temperature but is consistently off by a set amount When set at 350 degrees, a thermometer would always read 325 degrees | 
| Valid and Unreliable | Valid and Reliable | 
| The oven sometimes is the right temperature When set at 350 degrees, a thermometer would occasionally read 350 degrees | The oven is always at the right temperature When set at 350 degrees, a thermometer would always read 350 degrees | 
Here is a visual representation of the four different combinations of validity and reliability.
As you can see, the not valid and not reliable attempts are all over the place with nothing on target. The reliable attempts are clustered together, but not on target. There are some attempts that hit the target in the valid but not reliable quadrant. Finally, when the attempts are valid and reliable, they are all on target.
Getting back to grading, to be valid and reliable, we would need to be able to answer yes to the following two questions:
- Do the grades take into account student content knowledge only? 
- Would the grade be the same grade given by any teacher? 
Let’s address each of these separately first before linking them together.
Do the grades take into account student content knowledge only?
Content knowledge refers to the intended learning of the standards. Here is an example of a New York State standard for English: “Analyze how and why individuals, events, and ideas develop and interact over the course of a text.” Below are just three possibilities of how a student might demonstrate their knowledge of the content relative to this standard:
- Write an essay or response that outlines the key individuals, events, and ideas. They should then delve into how these elements interact and develop, using textual evidence to support their analysis. 
- Pinpoint specific quotes, passages, or details from the text that support their claims about interactions. 
- Explain how events trigger reactions in individuals or the development of ideas. 
Now, let’s imagine the student did the assignment but turned it in late. If the child’s grade on the assignment is penalized for being late, then the grade has not taken into account only the student’s knowledge of the content. This happens all the time. If points are taken off for late work, then the score reflects a student’s behaviors toward learning, not the child’s knowledge of the content. In this case, a student might have scored a 100% on the assignment. However, the child’s score is reduced from 100% to 80% due to the penalty. This reduced score, unfortunately, does not reflect the child’s knowledge of the content and gives a false negative regarding what the student knows related to the content knowledge..
On the opposite end, my youngest son has a teacher who awards five points on test grades if the students have their parents sign the test. In other words, my signature, which obviously isn’t related to any standard, is worth 5% of my son’s grade. If we play this out, there can be a student who for any number of reasons does not get their test that scored a 65 signed. Another student who scored a 60 could get their test signed. These two students now have the same score in the grade book yet they certainly do not have the same knowledge of the content. If a 65 is passing, the second student has now passed a test they otherwise would have failed. This inflated score, unfortunately, does not reflect only the child’s knowledge of the content and gives a false positive regarding what the student knows related to the course.
Would the grade be the same grade given by any teacher?
It’s easy to see how performance-based assignments and product-based assignments (like essays, artwork, etc.) are subjective and present a high risk for wildly different scores. As they say, “Beauty is in the eye of the beholder.” In this sense, understanding the challenges related to interrater reliability (consistency) seems obvious.
Less obvious, and therefore, posing a more insidious threat, is the ability of teachers to create their own system for grading. For example, imagine there is a team of teachers who co-create and administer the exact same assignments. No matter which teacher the students have, they will all have the same tasks and tests. This is wonderful and I would not only support this type of collaboration, I would create systems for this to happen. If, however, the teachers do not have the same approach to grading, then the level of intended consistency will never be realized. Here is what that could look like in action.
Imagine there are six teachers on this team, but each teacher weights the work differently. Which teacher do you want?

Now, let’s look at the work of three different students.

If we apply the six different grading policies to the three different students, you can see how wildly the students’ scores for the course change. What’s critical to remember when looking at the grades is that the work the students did is exactly the same for each student–what changed is the way the teacher calculated the grades. Thus, the student’s report card grades will be a function of the teacher’s beliefs about grading and not the student’s knowledge of the content.

Taken together, it is clear and frightening how invalid and unreliable student grades can be. Between including aspects of student behavior that are unrelated to a student’s content knowledge and the impact that an individual teacher’s grading approach can have on a student’s grade, we should all be concerned about not just how we have historically calculated student grades, but also how those grades are used. After all, Student B in the example above is going to be unable to participate in school sports, unable to attend school events, etc., if Student B has Teacher 6. That same student is almost on the honor roll if they have Teacher 1 or Teacher 3. This is not just a swing of twenty percentage points, it’s the difference between extrinsic factors like detentions or scholarships, and intrinsic factors like how a student feels about school and their identity as a learner. What’s even more frightening is that this example only shows three tasks and one assessment. Since teachers often have many more tasks and assignments in each marking period, the range of outcomes can be even wider.
Food for Thought
Just like my oven which consistently runs cold, the current grading system suffers from a similar issue. While it might be reliable in a sense–you can generally expect a certain level of teacher bias or subjectivity–it's not valid in accurately reflecting a student's true understanding. The 96.66 versus 96.33 debate becomes irrelevant when both numbers are potentially off the mark, influenced more by the teacher's "grading temperature" than the student's actual knowledge.
The ideal scenario, like the perfect oven temperature, would be a system that's both valid and reliable. Grades would then truly represent a student's grasp of the material, free from extraneous factors like penalties for late work or inconsistencies in teacher expectations. This doesn't mean throwing out grades entirely, but rather striving for a system that's more like a well-calibrated oven–one that consistently and accurately reflects what's inside. By acknowledging the limitations of the current system and working towards improvements, we can ensure grades become a more reliable gauge of student learning, a tool that empowers rather than misleads.
Consider this letter the appetizer, since this is just the first of a multi-part series on grading that I’ll be sharing with you over the next several weeks. I hope I have whet your appetite for the upcoming courses.
~Heather
P.S. Jennifer Borgioli Binis is one of the smartest people I know. Jenn is president of Schoolmarm Advisors. A former middle school special education teacher and professional development provider, she has facilitated large- and small-scale audit and design projects with and for schools, districts, and states.
Jenn recently started a new newsletter called, “Dissertate,” which is my Catch of the Week. You can subscribe to the newsletter by clicking here. Subscribing to Jenn’s newsletter will achieve two wonderful outcomes. First, you will benefit from Jenn’s brilliance. Even if the content of the newsletter is not something you would normally be interested in, subscribe anyway and send the link to those who would be interested. Second, you will help Jenn achieve her goal of garnering a larger social media footprint to assist her in getting her next book published. Please join me in helping Jenn achieve this goal.
P.P.S. Please remember to...
Like and share this post
Check out other posts
Subscribe to www.lyonsletters.com
Buy and rate your copy of Engagement is Not Unicorn (It's a Narwhal),
From Amazon or Barnes & Noble







Comments