Here’s a little pop quiz.
Multiple-choice tests are useful because:
A: They’re cheap to score.
B: They can be scored quickly.
C: They score without human bias.
D: All of the above.
It would take a computer about a nano-second to mark “D” as the correct answer. That’s easy.
But now, machines are also grading students’ essays. Computers are scoring long form answers on anything from the fall of the Roman Empire, to the pros and cons of government regulations.
Developers of so-called “robo-graders” say they understand why many students and teachers would be skeptical of the idea. But they insist, with computers already doing jobs as complicated and as fraught as driving cars, detecting cancer, and carrying on conversations, they can certainly handle grading students’ essays.
“I’ve been working on this now for about 25 years, and I feel that … the time is right and it’s really starting to be used now,” says Peter Foltz, a research professor at the University of Colorado, Boulder. He’s also vice president for research for Pearson, the company whose automated scoring program graded some 34 million student essays on state and national high-stakes tests last year. “There will always be people who don’t trust it … but we’re seeing a lot more breakthroughs in areas like content understanding, and AI is now able to do things which they couldn’t do really well before.”
Foltz says computers “learn” what’s considered good writing by analyzing essays graded by humans. Then, the automated programs score essays themselves by scanning for those same features.
“We have artificial intelligence techniques which can judge anywhere from 50 to 100 features,” Foltz says. That includes not only basics like spelling and grammar, but also whether a student is on topic, the coherence or the flow of an argument, and the complexity of word choice and sentence structure. “We’ve done a number of studies to show that the scoring can be highly accurate,” he says.
To demonstrate, he takes a not-so-stellar sample essay, rife with spelling mistakes and sentence fragments, and runs it by the robo-grader, which instantly spits back a not-so-stellar score.
“It gives an overall score of two out of four,” Foltz explains. The computer also breaks it down in several categories of sub-scores showing, for example, a one on spelling and grammar, and a two on task and focus.
Several states including Utah and Ohio already use automated grading on their standardized tests. Cyndee Carter, assessment development coordinator for the Utah State Board of Education, says the state began very cautiously, at first making sure every machine-graded essay was also read by a real person. But she says the computer scoring has proven “spot-on” and Utah now lets machines be the sole judge of the vast majority of essays. In about 20 percent of cases, she says, when the computer detects something unusual, or is on the fence between two scores, it flags an essay for human review. But all in all, she says the automated scoring system has been a boon for the state, not only for the cost savings, but also because it enables teachers to get test results back in minutes rather than months.
Massachusetts is among those now intrigued by the possibilities, and considering jumping on the bandwagon to have computers score essays on its state-wide Massachusetts Comprehensive Assessment System (MCAS) tests.
Commissioner of Elementary and Secondary Education Jeffrey C. Riley called the prospect “exciting” at a recent Board of Elementary and Secondary Education meeting outlining plans to look into the idea. “I’m suspending belief that this is possible,” he said.
Department Of Education Deputy Commissioner Jeff Wulfson cited “huge advances in artificial intelligence in the last few years” and cracked, “I asked Alexa whether she thought we’d ever be able to use computers to reliably score tests, and she said absolutely.”
But many teachers are unconvinced.
“The idea is bananas, as far as I’m concerned,” says Kelly Henderson, an English teacher at Newton South High School just outside Boston. “An art form, a form of expression being evaluated by an algorithm is patently ridiculous.”
Another English teacher, Robyn Marder, nods her head in agreement. “What about original ideas? Where is room for creativity of expression? A computer is going to miss all of that,” she says.
Marder and Henderson worry robo-graders will just encourage the worst kind of formulaic writing.
“What is the computer program going to reward?” Henderson challenges. “Is it going to reward some vapid drivel that happens to be structurally sound?”
Turns out that’s an easy question to answer, thanks to MIT research affiliate, and longtime-critic of automated scoring, Les Perelman. He’s designed what you might think of as robo-graders’ kryptonite, to expose what he sees as the weakness and absurdity of automated scoring. Called the Babel (“Basic Automatic B.S. Essay Language”) Generator, it works like a computerized Mad Libs, creating essays that make zero sense, but earn top scores from robo-graders.
To demonstrate, he calls up a practice question for the GRE exam that’s graded with the same algorithms that actual tests are. He then enters three words related to the essay prompt into his Babel Generator, which instantly spits back a 500-word wonder, replete with a plethora of obscure multisyllabic synonyms:
“History by mimic has not, and presumably never will be precipitously but blithely ensconced. Society will always encompass imaginativeness; many of scrutinizations but a few for an amanuensis. The perjured imaginativeness lies in the area of theory of knowledge but also the field of literature. Instead of enthralling the analysis, grounds constitutes both a disparaging quip and a diligent explanation.”
“It makes absolutely no sense,” he says, shaking his head. “There is no meaning. It’s not real writing.”
But Perelman promises that won’t matter to the robo-grader. And sure enough, when he submits it to the GRE automated scoring system, it gets a perfect score: 6 out of 6, which according to the GRE, means it “presents a cogent, well-articulated analysis of the issue and conveys meaning skillfully.”
“It’s so scary that it works,” Perelman sighs. “Machines are very brilliant for certain things and very stupid on other things. This is a case where the machines are very, very stupid.”
Because computers can only count, and cannot actually understand meaning, he says, facts are irrelevant to the algorithm. “So you can write that the War of 1812 began in 1945, and that wouldn’t count against you at all,” he says. “In fact it would count for you because [the computer would consider it to be] good detail.”
Perelman says his Babel Generator also proves how easy it is to game the system. While students are not going to walk into a standardized test with a Babel Generator in their back pocket, he says, they will quickly learn they can fool the algorithm by using lots of big words, complex sentences, and some key phrases – that make some English teachers cringe.
“For example, you will get a higher score just by [writing] “in conclusion,'” he says.
Gaming the system?
But Nitin Madnani, senior research scientist at Educational Testing Service (ETS), the company that makes the GRE’s automated scoring program, says that’s not exactly a hack.
“If someone is smart enough to pay attention to all the things that an automated system pays attention to, and to incorporate them in their writing, that’s no longer gaming, that’s good writing,” he says. “So you kind of do want to give them a good grade.”
GRE essays are still always scored by a human reader as well as a computer, Madnani says. So pure babble would never pass a real test.
But in places like Utah, where tests are graded by machines only, scampish students are giving the algorithm a run for its money.
“Students are genius, and they’re able to game the system,” notes Carter, the assessment official from Utah.
One year, she says, a student who wrote a whole page of the letter “b” ended up with a good score. Other students have figured out that they could do well writing one really good paragraph and just copying that four times to make a five-paragraph essay that scores well. Others have pulled one over on the computer by padding their essays with long quotes from the text they’re supposed to analyze, or from the question they’re supposed to answer.
But each time, Carter says, the computer code is tweaked to spot those tricks.
“We think we’re catching most things now,” Carter says, but students are “very creative” and the computer programs are continually being updated to flag different kinds of ruses.
“In this game of cat and mouse, the vendors have already identified [these] strategy[ies],” says David Shermis, dean and professor at the School of Education at the University of Houston, Clear Lake, who’s an expert in automated scoring. As a safeguard, all essays get not only a score, but also a “confidence” rating. “So those essays will be scored with ‘low confidence,’ and [the computer] will say ‘please have a human have a look at this,'” he says.
Critics of robo-grading also worry it will change the way teachers teach. “If teachers are being evaluated on how well their students perform on [standardized tests that are machine-graded] and schools are being evaluated on how well they test, then teachers are going to be teaching to the test,” says Perelman. And teachers will be teaching students to produce the wrong thing.”
‘The facts are secondary’
Indeed, being a good writer is not the same thing as being a “higher-scoring GRE essay writer,” says Orion Taraban, executive director of Stellar GRE, a tutoring company in San Francisco.
“Students really need to appreciate that they’re writing for a machine … [and when students] agonize over crafting beautiful, wonderfully logically coherent and empirically validated paragraphs, it’s like pearls before swine. The computer can’t appreciate what this person has done and they don’t get the score that they deserve.”
Instead, Taraban tutors students to give the computer what it wants. “I train them in fabricating evidence and fabricating fake studies, which is a lot of fun,” he says, quickly adding, “but I also tell them not to do this in real life.”
For example, when writing a persuasive essay, Taraban advises students to use a basic formula and get creative. It goes something like this:
“A [pick any year] study by Professor [fill in any old name] at the [insert your favorite university] in which the authors analyze [summarize the crux of the debate here], researchers discovered that [insert compelling data here] … and that [offer more invented, persuasive evidence here.] This demonstrates that [go to town boosting your thesis here!]”
His students do this all the time, using the name of say, their roommate, and citing that fake expert’s fake research to bolster their argument. More often than not, they’ve been rewarded with great scores.
“Yeah, we see a lot of that,” concedes Madnani, who works for ETS on the GRE automated scoring program. “But it’s not the end of the world.” Even human readers, who may have two minutes to read each essay, would not take the time to fact check those kind of details, he says. “But if the goal of the assessment is to test whether you are a good English writer, then the facts are secondary.”
It’s a different story on achievement tests that are meant to test a student’s mastery of history, for example. In those cases it would matter if a student writes that the War of 1812 began in 1945. AI systems can check facts against a database, he says, but that only works on very narrow questions. “If you might have millions of facts that could come in, there’s no possible way any automated system could verify all of them,” he says. “So that’s why we have humans in the loop.”
Ultimately, he says, computer programs are doing what they were designed to do: to assess whether a student knows how to construct an essay, with a thesis, evidence, and a conclusion, all in good English. It’s true that a transitional phrase like “in conclusion” signals to the algorithm that you’ve got one, just as “firstly,” “secondly,” and “thirdly,” broadcasts that a student is moving through a multifaceted argument. Purists may turn their nose up at that kind of formulaic writing, but as developers note, the computers learn what good writing is from teachers, and just mirror that. “Only if teachers think that writing ‘in conclusion’ is a good structure to use, then students will tend to be rewarded for it,” says Foltz, from Pearson.
So, in conclusion, robo-grading technology may indeed be “demonstrating proficiency” and “learning new skills.” But experts say, it’s also still got plenty of “room for improvement.”