๐ŸŽ‰ ACCEPTED at IUI 2026

Improving Human Verification
of LLM Reasoning through
Interactive Explanation Interfaces

1University of Virginia    2Auburn University    3Independent Researcher

Transforming static chain-of-thought outputs into interactive interfaces that enhance verification accuracy and reduce cognitive load

The Challenge

As LLMs generate increasingly longer chain-of-thought reasoning (often thousands of tokens), users face a critical challenge: How can they efficiently comprehend LLM reasoning and detect errors or hallucinations when presented with walls of text?

Traditional vs Interactive Interface

Figure 1: Comparison of traditional CoT and interactive interface showing navigation buttons and colored highlights

Given a GSM8K question, LLMs typically provide step-by-step reasoning followed by the final answer. However, such output presentation is often static and long, posing higher cognitive load and leading to slower and more erroneous answer verification. In contrast, we prompt LLMs to generate interactive HTML/JavaScript applications with (a) navigation buttons (inspired by common IDEs) and (b) colored highlights. This interface enables users to verify reasoning more efficiently via tools that reduce cognitive load and improve verification efficiency.

Try the Interactive Explanation Formats

Experience how our three interactive formats help you verify LLM reasoning

Problem Statement

A bathroom has 10 6-inch tiles along its width and 20 6-inch tiles along its length. What is the square footage of the bathroom?

Problem Summary
  • Width: 10 tiles ร— 6 inches
  • Length: 20 tiles ร— 6 inches
Goal: Calculate the square footage of the bathroom

Three Interactive Frameworks

We introduce novel explanation formats that improve interpretability and usability of LLM reasoning
๐Ÿ’ก Hover over each card to see examples

๐Ÿ”—

iCoT

Interactive Chain-of-Thought โ€” Navigate through reasoning steps with IDE-inspired controls, colored highlights, and structured presentation that reduces cognitive load.

iCoT Example

Discrete blocks with color-coded variables

Step 1: Calculate width
Width = 10 tiles ร— 6 in = 60 inches
Step 2: Convert to feet
Width = 60 in รท 12 = 5 feet
Step 3: Calculate area
Area = 5 ft ร— 10 ft = 50 sq ft

โœ“ Visual segmentation
โœ“ Color-coded tracking
โœ“ Sequential playback

โš™๏ธ

iPoT

Interactive Program-of-Thought โ€” Program-based reasoning with interactive execution visualization, making computational logic transparent and verifiable.

iPoT Example

Code-based reasoning with variable tracking

# Calculate bathroom area
width_tiles = 10
length_tiles = 20
tile_size = 6 # inches

width_inches = width_tiles ร— tile_size
length_inches = length_tiles ร— tile_size

width_feet = width_inches / 12
length_feet = length_inches / 12

bathroom_area = width_feet ร— length_feet

โœ“ Systematic computation
โœ“ Variable panel updates
โœ“ Step-by-step execution

๐Ÿ“Š

iGraph

Interactive Graph โ€” Graph-based reasoning visualization that reveals structural relationships and dependencies in complex logical flows.

iGraph Example

Interactive node-link graph

รท12 รท12 ร— ร— Width 60 in Width 5 ft Length 120 in Length 10 ft โœ“ Area 50 sq ft

โœ“ Draggable nodes
โœ“ 85.6% accuracy
โœ“ Clear dependencies

From Static to Interactive

Transforming how users engage with AI reasoning

Traditional CoT
Long, static text output
Linear presentation
High cognitive load
Difficult error detection
Time-consuming verification

"Wall of text" problem
Interactive Interface
โœ“ Navigation controls
โœ“ Step-by-step exploration
โœ“ Colored highlights
โœ“ Structured presentation
โœ“ Interactive HTML/JS apps

Enhanced comprehension

Video Overview

Watch a comprehensive walkthrough of our interactive explanation interfaces

Key Findings

User study with 125 participants demonstrates significant improvements

125
Participants
85.6%
iGraph Accuracy
57.9s
iGraph Speed

Main Findings:

  • โ–ธ iGraph achieves highest verification accuracy at 85.6%, followed by iPoT (82.5%), iCoT (80.6%), significantly outperforming traditional CoT (73.5%)
  • โ–ธ Interactive interfaces reduce validation time โ€” iGraph users complete tasks in 57.9 seconds, compared to iPoT and iCoT (60s) and CoT (64.7s)
  • โ–ธ Error localization improved dramatically โ€” iGraph achieves 85.2% accuracy in identifying exact error steps, vs 66.1% for traditional CoT
  • โ–ธ Users strongly prefer interactive formats โ€” Post-study questionnaires show iGraph rated highest for understanding (91%), engagement (73%), and satisfaction (91%)

Verification Accuracy

โ†‘ +12.1%
85.6% 73.5%

iGraph vs Traditional CoT

Time Required

โ†“ -6.8s
57.9s 64.7s

10.5% faster verification

Error Localization

โ†‘ +19.1%
85.2% 66.1%

Identifying exact error steps

User Preference

โญ Top Rated
91% Satisfaction

iGraph rated highest overall

Try It Yourself

Try on HuggingFace ๐Ÿค—
Experience our interactive explanation interfaces firsthand. Test how iCoT, iPoT, and iGraph transform the verification process for LLM reasoning on math problems from GSM8K and other benchmarks.

Click to visit our HuggingFace Space โ†’