How to Interpret Tree Test Results: Success Rate, Directness & Time
You ran your tree test, collected responses from 20+ participants, and now you have a spreadsheet full of numbers. The difference between a tree test that improves your product and one that sits in a slide deck comes down to how well you interpret those numbers and translate them into decisions.
This guide walks through the three core metrics — success rate, directness, and time to completion — explains what good and bad scores actually look like, and shows you how to turn findings into prioritized IA improvements.
The Three Core Metrics
Tree testing produces three primary metrics for each task. Understanding what each one measures and where it falls short on its own is the foundation of good analysis.
Success Rate
Success rate is the percentage of participants who navigated to the correct location for a given task. It is the most straightforward metric and the one most stakeholders will ask about first.
How to calculate it: Divide the number of participants who selected the correct destination by the total number of participants who attempted the task.
Benchmark ranges:
- Above 80% — Strong findability. The label and location match user expectations clearly.
- 60-80% — Acceptable but worth investigating. Some users struggle, and the cause is usually a labeling or grouping issue that a small adjustment can fix.
- Below 60% — Structural problem. Users consistently look in the wrong place, which means the item is categorized in a way that conflicts with their mental model.
- Below 40% — Critical failure. The navigation path is fundamentally broken for this task and needs immediate restructuring.
Success rate alone does not tell the full story. A task can have a 75% success rate but still indicate problems if most successful participants backtracked multiple times before finding the answer.
Directness
Directness measures the proportion of successful participants who reached the correct answer on their first attempt — without clicking into a wrong branch and backing out. This metric separates confident navigation from trial-and-error exploration.
What directness reveals:
- High success + high directness — Users know exactly where to go. Your labels and grouping are working.
- High success + low directness — Users eventually find the item, but they wander first. Labels may be ambiguous, or competing categories confuse them before they self-correct.
- Low success + high directness — Users go confidently to the wrong place. This typically means a label is misleading, pulling users toward an incorrect but plausible destination.
- Low success + low directness — Users are lost. The item has no natural home in your current structure from the user's perspective.
Directness is especially valuable because it predicts real-world behavior. A user on a live site who has to backtrack from one section to another will often give up and use search instead — or leave entirely.
Time to Completion
Time measures how long participants take to complete each task. While success rate and directness tell you whether users found the right place, time tells you how much cognitive effort the journey required.
How to use time data effectively:
- Compare tasks against each other rather than using absolute benchmarks. A task about finding "Contact Support" should take less time than finding a specific product feature.
- Look for outliers. If most participants complete a task in 15 seconds but a cluster takes 60+, those slower participants likely represent a distinct mental model worth investigating.
- Discard incomplete attempts from time averages, as abandoned tasks skew the data.
Time is the weakest of the three metrics when used alone, but it adds important context. A task with 78% success and fast completion times is in better shape than one with 78% success and slow, labored navigation.
Step 1: Build Your Analysis Spreadsheet
Before interpreting individual tasks, organize all three metrics into a single view. For each task, record:
- Task description — The scenario you gave participants
- Success rate — Percentage who found the correct answer
- Directness rate — Percentage of successful users who went straight there
- Average time — Mean completion time (excluding abandoned attempts)
- Top incorrect destinations — The 2-3 most common wrong answers
This consolidated view lets you spot patterns across tasks rather than analyzing each one in isolation.
Step 2: Flag Problem Tasks
Sort your spreadsheet by success rate, lowest first. Any task below 70% success deserves detailed investigation. But do not stop there — also flag tasks where:
- Directness is below 50% even if success is above 70%
- Time is more than double the average across all tasks
- A single incorrect destination attracted 20%+ of participants
These secondary flags catch problems that success rate alone misses.
Step 3: Map Failure Patterns
For each flagged task, examine where participants actually went. Most tree testing tools show you a path analysis or "pietree" visualization that maps every click.
Look for these patterns:
- Single competing destination — Most failures go to the same wrong place. This means two categories have overlapping scope, and users cannot distinguish between them. Fix by renaming one or both categories, or by moving the item.
- Scattered failures — Participants go to many different wrong places. This means the item has no intuitive home in your structure. Consider whether the item needs a new top-level category or whether your labels are too abstract.
- First-click errors that self-correct — Participants start in the wrong branch but backtrack and find the answer. This points to a labeling problem at the top level that resolves once users see the contents of each section.
Step 4: Prioritize Fixes by Business Impact
Not every failed task deserves equal attention. Prioritize based on two factors:
- Severity — How far below the 70% threshold is the success rate? A task at 65% needs a different response than one at 35%.
- Business importance — How frequently will real users attempt this task? A task that represents a daily user action matters more than an edge case performed once a year.
Create a simple 2x2 matrix: high severity + high importance items get fixed first, low severity + low importance items go to the backlog.
Step 5: Generate Recommendations
For each prioritized issue, write a specific recommendation tied to the data:
- Rename categories when a single competing destination attracts users with a misleading label
- Move items when users consistently expect an item in a different branch
- Split categories when a single category contains too many unrelated items and users cannot predict what lives inside it
- Merge categories when scattered failures suggest users do not see a meaningful distinction between two separate sections
Each recommendation should reference the specific task, its success rate, and the failure pattern that supports the change.
Validating Your Changes
After implementing IA changes based on your analysis, run a follow-up tree test with 8-10 participants to confirm improvements. Focus the follow-up test on the specific tasks that failed in the original study. If you used card sorting to generate your initial structure, this validation loop closes the research cycle.
Compare before-and-after success rates for each revised task. Improvements of 15+ percentage points confirm your changes addressed the underlying problem. Smaller improvements may indicate that additional structural changes are needed.
Common Interpretation Mistakes
Treating 100% success as the goal. Some tasks will never reach 100% because a small percentage of participants will always click randomly or misunderstand the scenario. Target 80%+ for critical tasks and 70%+ for everything else.
Ignoring successful tasks. Tasks with high success rates and high directness tell you what is working in your IA. Preserve these patterns when restructuring problem areas — avoid breaking what already works.
Over-indexing on time. Slow completion does not always mean bad IA. Complex or deeply nested items naturally take longer to find. Only flag time as a problem when it is disproportionate to task complexity.
Analyzing tasks in isolation. The most valuable insights come from patterns across multiple tasks. If three tasks that involve your "Resources" section all have low directness, the problem is the section itself — not the individual items.
Further Reading
- How to Plan and Run a Tree Test Study
- Tree Testing (UX Glossary)
- Card Sorting vs Tree Testing
- How to Analyze Card Sort Results
Frequently Asked Questions
What is a good success rate for a tree test? A task success rate above 70% is generally considered acceptable. Rates above 80% indicate strong findability, while rates below 50% signal critical navigation problems. Use these thresholds as guidelines rather than rigid cutoffs — context matters, and a 68% success rate on a complex task may be acceptable while 72% on a simple one may warrant investigation.
What does directness mean in tree testing? Directness measures the percentage of participants who found the correct answer on their first attempt without backtracking. A high success rate paired with low directness means users eventually find items but struggle along the way, which predicts frustration and search reliance in real-world navigation.
How many tasks should I include in a tree test? Most studies include 6-10 tasks. Fewer than 5 tasks may not cover enough of your IA, while more than 12 risks participant fatigue that degrades data quality for later tasks. Prioritize tasks that map to your most important user goals and known problem areas.