Investigation Method · Step Zero

Most investigations fail in the first five minutes

Box the Problem

Not because the engineer is wrong — because they're solving the right problem in the wrong shape. I've watched teams spend two days chasing AV when the real question was: which endpoint, which job, which window, what changed. That's not a skill gap. That's a framing gap. So before I touch anything — I box the problem.

// What most people do

"AV is breaking backups" → start pulling AV exclusions

// What I do first

Which job? Which agent? Which repo? What error? What changed? What time window?

Six filters · stop wasted motion before it starts

01 ·

Scope

One endpoint or many?

This single answer tells me whether I'm looking at a local condition or a systemic failure. The remediation is completely different.

02 ·

Impact

Broken, degraded, or at risk?

I classify by availability, integrity, confidentiality — not by how loud the ticket is. Loud is not the same as critical.

03 ·

Time Behavior

When does it happen?

Constant, intermittent, after reboot, after patch, after policy sync. When it happens is often as valuable as what happens.

04 ·

Reproducibility

Can I trigger it safely?

If I can reproduce it, I can test it. If I can't, I need to catch it live — and that changes my whole collection strategy.

05 ·

Blast Radius

What's adjacent?

What can become collateral? I need to know what I'm not allowed to break before I start moving anything.

06 ·

Constraints

What can't I touch?

What must stay online? What's compliance-sensitive? Knowing the hard edges tells me exactly where I'm allowed to work.

Two anchors · create the time box

FKG

Anchor One

First Known Good

Last confirmed point where everything worked. If the user can't provide it, I derive it from monitoring, agent check-ins, job history, and log timestamps.

window

FKB

Anchor Two

First Known Bad

Earliest confirmed failure point. Together, these two anchors create a time box tight enough to search for meaningful change — not a full week of noise.

Ruling things out · this is where investigations win time

// if [condition] → then [constraint on the investigation]

Failure affects only one device

I do not start rewriting global policies

Service is healthy but slow

I treat it as degradation — not an outage. Completely different fix.

Backups fail only when AV scan peaks

I treat it as a timing interaction until evidence says otherwise

Automation ran inside the failure window

It's a suspect action. I go find exactly what it changed.

Someone says "it's AV" or "it's the network"

That's a category, not a root cause. It stays a hypothesis.

The output · a tight container I can operate inside

boxed_problem.template

Observed Symptom

One sentence, observable condition — not an assumption

Scope & Impact

Endpoint count · service layer · A / I / C classification

First Known Good

Timestamp · source (log, monitoring, job history, user report)

First Known Bad

Timestamp · source · how confirmed

Frequency & Repro

Constant / intermittent · trigger · safe repro method if known

Out of Scope

What the failure behavior explicitly rules out

Allowed Hypotheses

Only theories that fit the observed behavior. Nothing else makes the list yet.

Next Evidence

Specific logs or tests that will shrink the box to one mechanism