Researchers develop LightSIDE program for grading essays

Credit: Adelaide Cole/Art Editor Credit: Adelaide Cole/Art Editor

Can computers identify good writing as well as humans? According to a recent study, they can, and some researchers at Carnegie Mellon had a hand in it.

Elijah Mayfield, a first-year Ph.D. student in the Language Technologies Institute (LTI) who is advised by LTI professor Carolyn Rosé, recently participated in a study that looked at how various essay-scoring software programs matched up against human graders. After grading 800 different essays, Mayfield’s software, which he developed with the help of six other LTI researchers, was found to perform just as well as the other programs in terms of assigning an overall score.

“They performed statistically indistinguishably from each other,” Mayfield said of how well the computer programs performed in relation to human graders. “They all were at or slightly above or slightly below human performance.”

These results are just a few years ahead of plans for most states to move their current pencil-based standardized tests to computer-based ones in 2014, according to National Public Radio.

The hope is that the switch will lessen the cost of administering these tests, as computers would take on more of the grading labor. For example, Ohio could save as much as 40 percent on state testing costs each year by having computer programs play a larger role in grading.

Organized by researchers at the University of Akron, the study compared the scores of 800 student-written essays graded by nine different computer programs, one of which was Mayfield’s open-source “LightSIDE” text mining program.

Most of the programs in the study learned to grade by example. “You need to give it training examples and say, ‘This is what I’m trying to learn to do,’ ” Mayfield explained. “It will take the text and figure out how to do all of that automatically.” In short, the programs would take essays that were graded by humans, and try to mimic the grading patterns on new essays. Mayfield explained that this is the goal of machine learning.

In assessing the quality of an essay, LightSIDE put a lot of weight on sentence structure.

If the program recognizes a lot of highly structured and complex sentences, then the text is likely to be of high quality. However, in order to spot the complex sentences, the program first has to recognize the various pieces, such as certain parts of speech, that may compose those types of sentences.

“You’ll see things like coordinating clauses or conjunctions, sentences that aren’t just simple statements of facts,” Mayfield said. “If you’re looking at things like prepositional phrases or adverbials, all these things don’t occur in those basic sentences.” Mayfield explained that LightSIDE searches for these types of words and sentence structures because this is what appears to be the elements that human graders concentrate on in the examples that the program learns from.

“What the model has learned is that for human graders, whether they were told to or not, it’s those complex sentences and sentences that show a lot of structure and causality and coordination that they’re giving high scores to,” Mayfield said.

Some of the other vendors took a different approach during the essay-grading study. Their software looked for essay organization, such as how the thesis statement transitioned into the body paragraphs and how it flowed into the conclusion. Those methods worked in terms of assigning scores to essays that were similar to those given by human graders, but LightSIDE, despite its more general approach on assessing the text, did not lag behind in performance.

“Even if you use the simple, local structure of the sentence and the state-of-the-art machine learning, we can manage to match the performance of the vendors who are using all this complex rubric-based technology,” Mayfield said.

While the computer programs were able to assign a single score to an essay just as well as human graders could, the programs still lack the ability to recognize more abstract properties of an essay, such as creativity.

“As it gets to, ‘Does this person have literary worth? Is this person creative?’, Those are questions that are much harder for machine learning to do,” Mayfield said. “I don’t think we’re at a point yet in that domain that we can match human graders.”