Michael Krawczyk | Information Security Engineer · 18+ YRS · MCP
Return
Investigation Method · Step Zero
Most investigations fail in the first five minutes
Box the Problem
Box the Problem
Not because the engineer is wrong — because they're solving the right problem in the wrong shape. I've watched teams spend two days chasing AV when the real question was: which endpoint, which job, which window, what changed. That's not a skill gap. That's a framing gap. So before I touch anything — I box the problem.
// What most people do
"AV is breaking backups" → start pulling AV exclusions
VS
// What I do first
Which job? Which agent? Which repo? What error? What changed? What time window?
01 ·
Scope
One endpoint or many?
This single answer tells me whether I'm looking at a local condition or a systemic failure. The remediation is completely different.
02 ·
Impact
Broken, degraded, or at risk?
I classify by availability, integrity, confidentiality — not by how loud the ticket is. Loud is not the same as critical.
03 ·
Time Behavior
When does it happen?
Constant, intermittent, after reboot, after patch, after policy sync. When it happens is often as valuable as what happens.
04 ·
Reproducibility
Can I trigger it safely?
If I can reproduce it, I can test it. If I can't, I need to catch it live — and that changes my whole collection strategy.
05 ·
Blast Radius
What's adjacent?
What can become collateral? I need to know what I'm not allowed to break before I start moving anything.
06 ·
Constraints
What can't I touch?
What must stay online? What's compliance-sensitive? Knowing the hard edges tells me exactly where I'm allowed to work.
FKG
Anchor One
First Known Good
Last confirmed point where everything worked. If the user can't provide it, I derive it from monitoring, agent check-ins, job history, and log timestamps.
window
FKB
Anchor Two
First Known Bad
Earliest confirmed failure point. Together, these two anchors create a time box tight enough to search for meaningful change — not a full week of noise.
// if [condition] → then [constraint on the investigation]
IF
Failure affects only one device
I do not start rewriting global policies
IF
Service is healthy but slow
I treat it as degradation — not an outage. Completely different fix.
IF
Backups fail only when AV scan peaks
I treat it as a timing interaction until evidence says otherwise
IF
Automation ran inside the failure window
It's a suspect action. I go find exactly what it changed.
IF
Someone says "it's AV" or "it's the network"
That's a category, not a root cause. It stays a hypothesis.
boxed_problem.template
Observed Symptom
One sentence, observable condition — not an assumption
Scope & Impact
Endpoint count · service layer · A / I / C classification
First Known Good
Timestamp · source (log, monitoring, job history, user report)
First Known Bad
Timestamp · source · how confirmed
Frequency & Repro
Constant / intermittent · trigger · safe repro method if known
Out of Scope
What the failure behavior explicitly rules out
Allowed Hypotheses
Only theories that fit the observed behavior. Nothing else makes the list yet.
Next Evidence
Specific logs or tests that will shrink the box to one mechanism