How ThinkingEngine Scores Student Reasoning

Overview

How the Scoring Works

Every time a student completes a dialogue session, ThinkingEngine generates a detailed reasoning score. The score isn't a grade. It's a map of where the student's thinking is strong and where it needs more support.

Here's the dashboard you see after students complete a session:

thinkingengine.org/teacher

Session Transcript: "Should cities ban plastic bags?"

AI

A city is considering banning plastic bags. Some stores say paper bags take more energy to produce. Does the ban actually help the environment?

You

Yes because plastic is bad for the ocean.

AI

You make a fair point about ocean pollution. But some people say paper bags aren't actually better; they take more water and energy to produce. Does that change your view?

You

Maybe, but I'd want to see actual data before deciding. Like, how much energy does each bag use? And what happens to plastic bags that aren't banned?

Reasoning Scores

Needs Support

Developing

Strong

Alex T.

Evidence

8.8

Alternatives

7.9

Logic

7.2

Assumptions

6.2

Depth

8.1

Composite Score

7.6

Each student's score is built from five dimensions of reasoning, with a composite score that represents the overall quality of their thinking.

The scoring process has four steps:

AI runs
Socratic
dialogue

1. Dialogue

AI asks open questions, adapts to student level

→

Transcript
analyzed for
reasoning

2. Analysis

AI reads the transcript for reasoning moves

→

Per-dimension
scoring

3. Scoring

Each dimension gets a score on a 0–10 scale

→

Dashboard
shows scores
to teacher

4. Dashboard

Teacher reviews and plans follow-up

The AI doesn't grade students. It maps their reasoning. You use that map to guide them.

Scoring Dimensions

The Five Scoring Dimensions

Every session is scored across five dimensions of reasoning. These five dimensions are consistent across all session types. They were chosen because together they represent the core moves a skilled reasoner makes, regardless of the topic.

📋

Evidence Integration

Uses relevant facts, data, or examples to support claims, and acknowledges when evidence is missing.

🔄

Alternative Perspectives

Acknowledges other viewpoints, explains why someone might disagree, and weighs competing considerations.

🔗

Logical Consistency

Holds a position that follows from the evidence without internal contradictions or leaps.

💡

Assumption Identification

Recognizes when something is being taken for granted, names it explicitly, and considers whether it's justified.

🧭

Depth of Reasoning

Traces implications, considers second-order effects, and follows the logic of an argument to its conclusions.

When you review a student's score profile, you're not looking for a single number. You're looking at the shape of the profile. A student who scores 8.8 in Evidence but 5.2 in Alternatives is a different thinker than one who scores 7 across all five dimensions. Both can grow, but the coaching is different.

Score Interpretation

Understanding Your Scores: Three Tiers

Each dimension is scored on a 0–10 scale. The scores aren't letter grades. They're diagnostic markers that tell you what kind of support a student needs next. Here's how to read them:

0–3.9

Needs Support

4–6.9

Developing

7–10

Strong

The tier label describes the current reasoning level, not the student's potential or effort.

Needs Support (0–3.9): The student is reasoning at a surface level. They might be jumping to conclusions, missing counter-arguments, or relying on intuition rather than evidence. This doesn't mean they're not capable. It means the session surfaced the gaps clearly.

Developing (4–6.9): The student is doing real reasoning but inconsistently. They might have good evidence but miss the alternatives, or see the assumptions but not follow the logic through. This is where most students live most of the time, and where the most growth happens.

Strong (7–10): The student is reasoning at a genuinely sophisticated level. They're integrating evidence, considering alternatives, and following the logic of their position. A high score doesn't mean they're done. It means they're ready for harder questions.

Your lowest dimension score is often the most useful piece of information. It tells you which reasoning move to work on next, either in a follow-up session or in your next whole-class discussion.

Composite score: The overall score shown on the dashboard is the simple average of the five dimension scores, rounded to one decimal place. It gives you a quick summary, but the dimension breakdown is where the real coaching happens.

Session Types

Socratic Dialogue vs. Inquiry-Based Exploration

ThinkingEngine has two session types. They use different scoring dimensions because they're measuring different kinds of reasoning.

Socratic Dialogue

6 dimensions

1

Evidence Integration

2

Alternative Perspectives

3

Logical Consistency

4

Assumption Identification

5

Depth of Reasoning

6

Argument Parsimony

Inquiry-Based Exploration

5 dimensions

1

Observation Coverage

2

Hypothesis Breadth

3

Evaluative Rigor

4

Defense Quality

5

Assumption Identification

Socratic Dialogue scores six dimensions. The sixth, Argument Parsimony, measures whether a student uses the simplest reasoning necessary to support their claim, rather than piling on unnecessary complexity. In philosophy and ethics discussions especially, this matters: the student who makes the tightest argument is often the one who understands the problem most clearly.

Inquiry-Based Exploration uses five dimensions focused on scientific and investigative reasoning, specifically how well students gather evidence, generate hypotheses, evaluate those hypotheses against data, and defend their conclusions.

Both session types score Assumption Identification. It's the one dimension that appears in both rubrics because the ability to name what's being taken for granted is foundational to reasoning in every domain.

Reading Scores

What to Look For: A Response Ladder

To understand what scores actually look like in practice, it helps to see a single dimension across five levels of student reasoning. Here's Evidence Integration, in a discussion about whether the city should ban plastic bags:

Evidence Integration: Five Response Levels

Needs Support

1

"Plastic bags should be banned because they're bad."

1.4

Developing

2

"Plastic bags are bad for the environment, so banning them helps."

4.2

Developing

3

"Studies show that plastic waste is a major source of ocean pollution, which banning bags could reduce."

6.4

Strong

4

"Research indicates plastic makes up roughly 80% of ocean debris, but paper production has higher carbon emissions — so the environmental benefit isn't clear-cut."

8.1

Strong

5

"Ocean plastic data is robust (80%+ of debris), but lifecycle analyses of paper vs. plastic bags show paper has 3x the global warming potential per bag — yet plastic's end-of-life costs are severe in marine environments. The tradeoff depends on which environmental harm is prioritized."

9.7

Alternative Perspectives: Five Response Levels

Needs Support

1

"Plastic bags should be banned. They're terrible for the environment and that's that."

1.3

Developing

2

"Some people might not want plastic bags banned, I guess, but they should think about the environment."

4.0

Developing

3

"Store owners would probably disagree because they say plastic bags are cheaper and customers like them. They might say a ban would hurt small businesses."

6.5

Strong

4

"People who oppose the ban might argue that the real problem isn't plastic bags but littering behavior — and that targeting bags specifically doesn't solve the underlying issue. They'd point out that reusable bags require manufacturing too, and that some low-income households rely on cheap plastic bags for multiple uses around the home."

8.2

Strong

5

"The strongest counter-argument isn't just about cost — it's about distributional impact. People who oppose bag bans often argue that reusables carry hidden environmental costs that disproportionately affect lower-income households. Meanwhile, supporters emphasize the cumulative plastic problem. What's really at stake isn't a factual disagreement but a values conflict: do we prioritize marine ecosystem health or household-level affordability?"

9.6

Logical Consistency: Five Response Levels

Needs Support

1

"Plastic bags are bad. We should ban them. Done."

1.2

Developing

2

"Plastic bags cause pollution, and pollution is bad, so we should ban plastic bags."

3.9

Developing

3

"Plastic bags are used once and thrown away. Most people don't recycle them. They end up in the ocean. So banning them would keep them out of the ocean."

6.3

Strong

4

"Plastic bags are used once and discarded by most consumers. Most discarded bags are not recycled. A meaningful percentage of litter enters waterways and eventually the ocean. Therefore, reducing bag distribution would reduce marine plastic — assuming consumers switch to alternatives rather than simply substituting another product."

8.0

Strong

5

"If (1) most consumers use bags only once and discard them, and (2) most discarded bags enter the litter stream, and (3) a measurable fraction of litter enters marine environments, then (4) reducing bag distribution would reduce marine plastic load — conditional on (5) the substitution effect. Premise (5) is where the argument is most vulnerable. Studies show substitution partially offsets the marine benefit — but not entirely. So the conclusion: ban likely helps, but not as much as the raw numbers suggest."

9.6

Assumption Identification: Five Response Levels

Needs Support

1

"Plastic bags cause pollution, so we should ban them. That's just logical."

1.3

Developing

2

"Banning plastic bags would reduce pollution. People would use reusable bags instead."

3.8

Developing

3

"Banning plastic bags would reduce pollution — but only if people actually switch to reusable bags. I'm assuming they will, but maybe they won't."

6.4

Strong

4

"My argument assumes that banning plastic bags would cause consumers to switch to reusable alternatives. But evidence from cities that implemented bag bans suggests significant substitution with paper bags — which have their own environmental costs. If consumers substitute, the net environmental benefit depends on the comparative lifecycle impact. If the substitution assumption fails, my conclusion weakens considerably."

8.1

Strong

5

"My argument rests on three unstated assumptions. First: reduced bag distribution translates to reduced harm — but this assumes the substitute is meaningfully better. Second: consumer behavior is responsive to policy — but this assumes enforcement and compliance. Third: environmental benefit is the relevant metric — but this assumes the ban doesn't impose disproportionate costs on lower-income households. If any assumption fails significantly, the argument weakens. If two fail simultaneously, the argument inverts."

9.7

Depth of Reasoning: Five Response Levels

Needs Support

1

"Plastic bags should be banned because they're bad for the environment."

1.2

Developing

2

"If plastic bags are banned, then there will be fewer plastic bags in the ocean."

4.1

Developing

3

"If we ban plastic bags, retailers will need to switch to paper or reusable bags. That could mean higher costs for stores, which might get passed to consumers."

6.5

Strong

4

"If banning plastic bags does reduce marine plastic, we'd expect coastal ecosystems to improve over time, supporting fisheries and tourism. Conversely, if the ban is ineffective because consumers substitute paper bags (which have higher carbon footprints) or simply use thicker plastic bags as a workaround, we pay the economic cost without the environmental gain. The key conditional is substitution behavior: the policy only delivers if consumers don't simply substitute."

8.2

Strong

5

"If bag bans effectively reduce marine plastic, a broader suite of single-use product policies should similarly reduce ocean waste — suggesting the policy is a model, not just an isolated intervention. However, this depends on three upstream conditions: (1) the ban must change consumer behavior, (2) the substitute must not have equal or greater environmental cost, and (3) enforcement must be sufficient. If condition (1) fails, the policy is locally ineffective; if (2) fails, it is globally counterproductive; if (3) fails, it is inequitable. Conversely, if bans don't meaningfully reduce marine plastic — because the dominant pollution sources are industrial, not consumer — then the entire policy rationale collapses."

9.6

The jump from level 1 to level 2 is about adding a qualifier ("bad for the environment" instead of just "bad"). The jump from level 3 to level 4 is about acknowledging the counter-evidence. The jump from level 4 to level 5 is about naming the specific values in tension (which environmental harm is being prioritized) and making the reasoning framework explicit.

What you don't see in this ladder: appeals to authority ("experts say...") without specifics, or evidence cited without connection to the claim. Those score lower even if the underlying point is right.

Coaching

What to Do After You See the Scores

The score tells you where the gap is. What you say to close it is the teaching. Here are diagnostic questions mapped to the five dimensions. Use them in one-on-one conferences, small groups, or whole-class discussions:

Dimension 1

Evidence Integration

"What's the strongest piece of evidence you have for this position? What would change your view?"

Dimension 2

Alternative Perspectives

"Who would disagree with this? What would their best argument be?"

Dimension 3

Logical Consistency

"Walk me through your reasoning step by step. Is there a place where the logic skips?"

Dimension 4

Assumption Identification

"What are you assuming to be true that you haven't proven yet?"

Dimension 5

Depth of Reasoning

"If you're right, what else follows? If you're wrong, what else falls apart?"

Bonus

Whole-class pattern

"Based on what we're seeing across the class, what does good reasoning actually look like here? What does it sound like in action?"

Use the dimension-specific questions in individual conferences. Use the whole-class question when you notice a pattern across several students. It turns a score into a lesson.

The scores are not for ranking students. They're for identifying where each student needs to go next. A score of 4.2 on Alternatives is not a failure. It's a precise target for your next conversation with that student.

Putting It Together

A Suggested Classroom Routine

Here's one way to integrate ThinkingEngine into your week without adding hours of prep. This is a starting point. Adapt it to your content and pacing:

Weekly ThinkingEngine Workflow

1

Monday

Assign a topic. Students complete one session at home or in class.

2

Review Scores

Scan the score distribution. Note the lowest dimension. That's your focus for the week.

3

Wednesday

Small group session. Students discuss what they argued and why.

4

Friday

Lead a whole-class discussion using the diagnostic question for the week's focus dimension.

5

Next Week

Run the same topic again. Compare dimension scores. The improvement is the point.

The second session on the same topic is where the real learning happens. Students who got a 5.2 on Alternatives the first week often get a 7.4 the second week. Not because the topic was easier, but because they now know what to look for. The AI doesn't change its questions; the student changes their reasoning.

You're not building sessions. You're building a reasoning habit.