🎉 ACCEPTED at IUI 2026

Improving Human Verification
of LLM Reasoning through
Interactive Explanation Interfaces

Runtao Zhou¹• Giang Nguyen²• Nikita Kharya³• Anh Totti Nguyen²• Chirag Agarwal¹

¹University of Virginia ²Auburn University ³Independent Researcher

Transforming static chain-of-thought outputs into interactive interfaces that enhance verification accuracy and reduce cognitive load

📄 Read Paper 💻 GitHub Repo 🤗 HF Interactive Demo 🎥 Summary Video 📋 BibTeX Citation

The Challenge

As LLMs generate increasingly longer chain-of-thought reasoning (often thousands of tokens), users face a critical challenge: How can they efficiently comprehend LLM reasoning and detect errors or hallucinations when presented with walls of text?

Traditional vs Interactive Interface

Figure 1: Comparison of traditional CoT and interactive interface showing navigation buttons and colored highlights

Given a GSM8K question, LLMs typically provide step-by-step reasoning followed by the final answer. However, such output presentation is often static and long, posing higher cognitive load and leading to slower and more erroneous answer verification. In contrast, we prompt LLMs to generate interactive HTML/JavaScript applications with (a) navigation buttons (inspired by common IDEs) and (b) colored highlights. This interface enables users to verify reasoning more efficiently via tools that reduce cognitive load and improve verification efficiency.

Try the Interactive Explanation Formats

Experience how our three interactive formats help you verify LLM reasoning

Problem Statement

A bathroom has 10 6-inch tiles along its width and 20 6-inch tiles along its length. What is the square footage of the bathroom?

Problem Summary

Width: 10 tiles × 6 inches
Length: 20 tiles × 6 inches

Goal: Calculate the square footage of the bathroom

Three Interactive Frameworks

We introduce novel explanation formats that improve interpretability and usability of LLM reasoning
💡 Hover over each card to see examples

🔗

iCoT

Interactive Chain-of-Thought — Navigate through reasoning steps with IDE-inspired controls, colored highlights, and structured presentation that reduces cognitive load.

iCoT Example

Discrete blocks with color-coded variables

Step 1: Calculate width
Width = 10 tiles × 6 in = 60 inches

Step 2: Convert to feet
Width = 60 in ÷ 12 = 5 feet

Step 3: Calculate area
Area = 5 ft × 10 ft = 50 sq ft

✓ Visual segmentation
✓ Color-coded tracking
✓ Sequential playback

⚙️

iPoT

Interactive Program-of-Thought — Program-based reasoning with interactive execution visualization, making computational logic transparent and verifiable.

iPoT Example

Code-based reasoning with variable tracking

                                # Calculate bathroom area

                                width_tiles = 10

                                length_tiles = 20

                                tile_size = 6  # inches

                                width_inches = width_tiles × tile_size

                                length_inches = length_tiles × tile_size

                                width_feet = width_inches / 12

                                length_feet = length_inches / 12

                                bathroom_area = width_feet × length_feet
                            
                                ✓ Systematic computation

                                ✓ Variable panel updates

                                ✓ Step-by-step execution

📊

iGraph

Interactive Graph — Graph-based reasoning visualization that reveals structural relationships and dependencies in complex logical flows.

iGraph Example

Interactive node-link graph

✓ Draggable nodes
✓ 85.6% accuracy
✓ Clear dependencies

Key Findings

User study with 125 participants demonstrates significant improvements

125

Participants

85.6%

iGraph Accuracy

57.9s

iGraph Speed

Main Findings:

▸ iGraph achieves highest verification accuracy at 85.6%, followed by iPoT (82.5%), iCoT (80.6%), significantly outperforming traditional CoT (73.5%)
▸ Interactive interfaces reduce validation time — iGraph users complete tasks in 57.9 seconds, compared to iPoT and iCoT (60s) and CoT (64.7s)
▸ Error localization improved dramatically — iGraph achieves 85.2% accuracy in identifying exact error steps, vs 66.1% for traditional CoT
▸ Users strongly prefer interactive formats — Post-study questionnaires show iGraph rated highest for understanding (91%), engagement (73%), and satisfaction (91%)

Verification Accuracy

↑ +12.1%

85.6% 73.5%

iGraph vs Traditional CoT

Time Required

↓ -6.8s

57.9s 64.7s

10.5% faster verification

Error Localization

↑ +19.1%

85.2% 66.1%

Identifying exact error steps

User Preference

⭐ Top Rated

91% Satisfaction

iGraph rated highest overall

Improving Human Verification
of LLM Reasoning through
Interactive Explanation Interfaces

BibTeX Citation

The Challenge

Traditional vs Interactive Interface

Try the Interactive Explanation Formats

Three Interactive Frameworks

iCoT

iCoT Example

iPoT

iPoT Example

iGraph

iGraph Example

From Static to Interactive

Video Overview

Key Findings

Main Findings:

Verification Accuracy

Time Required

Error Localization

User Preference

Try It Yourself

Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces

BibTeX Citation

The Challenge

Traditional vs Interactive Interface

Try the Interactive Explanation Formats

Three Interactive Frameworks

iCoT

iCoT Example

iPoT

iPoT Example

iGraph

iGraph Example

From Static to Interactive

Video Overview

Key Findings

Main Findings:

Verification Accuracy

Time Required

Error Localization

User Preference

Try It Yourself

Improving Human Verification
of LLM Reasoning through
Interactive Explanation Interfaces