For Teachers

How ThinkingEngine Scores Student Reasoning

A guide to understanding your students' reasoning scores — what they measure, how they break down, and how to use them in your classroom.

How the Scoring Works

Every time a student completes a dialogue session, ThinkingEngine generates a detailed reasoning score. The score isn't a grade — it's a map of where the student's thinking is strong and where it needs more support.

Here's the dashboard you see after students complete a session:

thinkingengine.org/teacher
Session Transcript — "Should cities ban plastic bags?"
AI
A city is considering banning plastic bags. Some stores say paper bags take more energy to produce. Does the ban actually help the environment?
You
Yes because plastic is bad for the ocean.
AI
You make a fair point about ocean pollution. But some people say paper bags aren't actually better — they take more water and energy to produce. Does that change your view?
You
Maybe, but I'd want to see actual data before deciding. Like, how much energy does each bag use? And what happens to plastic bags that aren't banned?
Reasoning Scores
Needs Support
Developing
Strong

Alex T.
Evidence
8.8
Alternatives
7.9
Logic
7.2
Assumptions
6.2
Depth
8.1
Composite Score
7.6

Each student's score is built from five dimensions of reasoning, with a composite score that represents the overall quality of their thinking.

The scoring process has four steps:

AI runs
Socratic
dialogue
1. Dialogue
AI asks open questions, adapts to student level
Transcript
analyzed for
reasoning
2. Analysis
AI reads the transcript for reasoning moves
Per-dimension
scoring
3. Scoring
Each dimension gets a score on a 0–10 scale
Dashboard
shows scores
to teacher
4. Dashboard
Teacher reviews and plans follow-up
The AI doesn't grade students — it maps their reasoning. You use that map to guide them.

The Five Scoring Dimensions

Every session is scored across five dimensions of reasoning. These five dimensions are consistent across all session types. They were chosen because together they represent the core moves a skilled reasoner makes — regardless of the topic.

📋
Evidence Integration
Uses relevant facts, data, or examples to support claims — and acknowledges when evidence is missing.
🔄
Alternative Perspectives
Acknowledges other viewpoints, explains why someone might disagree, and weighs competing considerations.
🔗
Logical Consistency
Holds a position that follows from the evidence without internal contradictions or leaps.
💡
Assumption Identification
Recognizes when something is being taken for granted, names it explicitly, and considers whether it's justified.
🧭
Depth of Reasoning
Traces implications, considers second-order effects, and follows the logic of an argument to its conclusions.

When you review a student's score profile, you're not looking for a single number — you're looking at the shape of the profile. A student who scores 8.8 in Evidence but 5.2 in Alternatives is a different thinker than one who scores 7 across all five dimensions. Both can grow, but the coaching is different.


Understanding Your Scores: Three Tiers

Each dimension is scored on a 0–10 scale. The scores aren't letter grades — they're diagnostic markers that tell you what kind of support a student needs next. Here's how to read them:

0–3.9
Needs Support
4–6.9
Developing
7–10
Strong
The tier label describes the current reasoning level — not the student's potential or effort.

Needs Support (0–3.9): The student is reasoning at a surface level. They might be jumping to conclusions, missing counter-arguments, or relying on intuition rather than evidence. This doesn't mean they're not capable — it means the session surfaced the gaps clearly.

Developing (4–6.9): The student is doing real reasoning but inconsistently. They might have good evidence but miss the alternatives, or see the assumptions but not follow the logic through. This is where most students live most of the time, and where the most growth happens.

Strong (7–10): The student is reasoning at a genuinely sophisticated level. They're integrating evidence, considering alternatives, and following the logic of their position. A high score doesn't mean they're done — it means they're ready for harder questions.

Your lowest dimension score is often the most useful piece of information. It tells you which reasoning move to work on next — either in a follow-up session or in your next whole-class discussion.

Composite score: The overall score shown on the dashboard is the simple average of the five dimension scores, rounded to one decimal place. It gives you a quick summary, but the dimension breakdown is where the real coaching happens.


Socratic Dialogue vs. Inquiry-Based Exploration

ThinkingEngine has two session types. They use different scoring dimensions because they're measuring different kinds of reasoning.

Socratic Dialogue
6 dimensions
1
Evidence Integration
2
Alternative Perspectives
3
Logical Consistency
4
Assumption Identification
5
Depth of Reasoning
6
Argument Parsimony
Inquiry-Based Exploration
5 dimensions
1
Observation Coverage
2
Hypothesis Breadth
3
Evaluative Rigor
4
Defense Quality
5
Assumption Identification

Socratic Dialogue scores six dimensions. The sixth — Argument Parsimony — measures whether a student uses the simplest reasoning necessary to support their claim, rather than piling on unnecessary complexity. In philosophy and ethics discussions especially, this matters: the student who makes the tightest argument is often the one who understands the problem most clearly.

Inquiry-Based Exploration uses five dimensions focused on scientific and investigative reasoning — how well students gather evidence, generate hypotheses, evaluate those hypotheses against data, and defend their conclusions.

Both session types score Assumption Identification. It's the one dimension that appears in both rubrics because the ability to name what's being taken for granted is foundational to reasoning in every domain.


What to Look For: A Response Ladder

To understand what scores actually look like in practice, it helps to see a single dimension across five levels of student reasoning. Here's Evidence Integration, in a discussion about whether the city should ban plastic bags:

Evidence Integration — Five Response Levels
Needs Support
1
"Plastic bags should be banned because they're bad."
1.4
Developing
2
"Plastic bags are bad for the environment, so banning them helps."
4.2
Developing
3
"Studies show that plastic waste is a major source of ocean pollution, which banning bags could reduce."
6.4
Strong
4
"Research indicates plastic makes up roughly 80% of ocean debris, but paper production has higher carbon emissions — so the environmental benefit isn't clear-cut."
8.1
Strong
5
"Ocean plastic data is robust (80%+ of debris), but lifecycle analyses of paper vs. plastic bags show paper has 3x the global warming potential per bag — yet plastic's end-of-life costs are severe in marine environments. The tradeoff depends on which environmental harm is prioritized."
9.7

The jump from level 1 to level 2 is about adding a qualifier ("bad for the environment" instead of just "bad"). The jump from level 3 to level 4 is about acknowledging the counter-evidence. The jump from level 4 to level 5 is about naming the specific values in tension — which environmental harm you're trying to protect — and making the reasoning framework explicit.

What you don't see in this ladder: appeals to authority ("experts say...") without specifics, or evidence cited without connection to the claim. Those score lower even if the underlying point is right.


What to Do After You See the Scores

The score tells you where the gap is. What you say to close it is the teaching. Here are diagnostic questions mapped to the five dimensions — use them in one-on-one conferences, small groups, or whole-class discussions:

Dimension 1
Evidence Integration
"What's the strongest piece of evidence you have for this position? What would change your view?"
Dimension 2
Alternative Perspectives
"Who would disagree with this? What would their best argument be?"
Dimension 3
Logical Consistency
"Walk me through your reasoning step by step. Is there a place where the logic skips?"
Dimension 4
Assumption Identification
"What are you assuming to be true that you haven't proven yet?"
Dimension 5
Depth of Reasoning
"If you're right, what else follows? If you're wrong, what else falls apart?"
Bonus
Whole-class pattern
"Based on what we're seeing across the class, what does good reasoning actually look like here? What does it sound like in action?"

Use the dimension-specific questions in individual conferences. Use the whole-class question when you notice a pattern across several students — it turns a score into a lesson.

The scores are not for ranking students. They're for identifying where each student needs to go next. A score of 4.2 on Alternatives is not a failure — it's a precise target for your next conversation with that student.


A Suggested Classroom Routine

Here's one way to integrate ThinkingEngine into your week without adding hours of prep. This is a starting point — adapt it to your content and pacing:

Weekly ThinkingEngine Workflow
1
Monday
Assign a topic. Students complete one session at home or in class.
2
Review Scores
Scan the score distribution. Note the lowest dimension — that's your focus for the week.
3
Wednesday
Small group session. Students discuss what they argued and why.
4
Friday
Lead a whole-class discussion using the diagnostic question for the week's focus dimension.
5
Next Week
Run the same topic again. Compare dimension scores — the improvement is the point.

The second session on the same topic is where the real learning happens. Students who got a 5.2 on Alternatives the first week often get a 7.4 the second week — not because the topic was easier, but because they now know what to look for. The AI doesn't change its questions; the student changes their reasoning.

You're not building sessions. You're building a reasoning habit.