A note to the people who own the outcome
This is for the business owners, delivery leads, implementation partners, architects, and engineering managers behind ecommerce solutions.
As commerce ecosystems grow, they naturally collect more storefront logic, integrations, ERP dependencies, custom extensions, infrastructure choices, configuration, third-party services, and data patterns. A single symptom can appear in one layer while originating somewhere else entirely.
Structured Root Cause Analysis (RCA) helps teams manage that complexity. It gives everyone a shared way to move from symptom to evidence, from evidence to cause, and from cause to durable resolution.
So consider this the opening article in a series.
The goal is simple: help teams reach the real cause faster, make better technical decisions under pressure, and improve delivery quality across the commerce solution.
The mindset that resolves issues faster
In complex commerce environments, symptoms often point teams toward the wrong system first. A slow checkout might originate in infrastructure, integrations, configuration, data, customizations, platform behavior, or the interaction between several of those layers.
The useful question is which system behavior produced the outcome we are seeing, and what evidence will prove it.
A strong RCA process keeps the team focused on three practical questions:
- What can we stabilize right now?
- What evidence do we need to understand the behavior?
- What change will prevent the issue from recurring?
This is the posture that shortens resolution cycles. It treats RCA as an engineering and delivery discipline, not as a reactive checklist.
Prepare before the pressure is on
Modern commerce teams usually have enough tools to investigate well: telemetry, logs, profilers, distributed tracing, network captures, database diagnostics, deployment history, and environment configuration. The gap is rarely tool availability alone. The gap is whether the team knows how to use those tools together.
In calm weather, every delivery team should know the basics:
â—Ź How the solution is wired across storefront, platform, integrations, infrastructure, and data.
â—Ź Where telemetry, logs, errors, traces, and deployment history live.
â—Ź Who can access the right environments and whether a production-like copy exists for investigation.
â—Ź How to reproduce a behavior cleanly and observe it with reliable metrics.
Preparation is not overhead. It is what turns a production issue from a long guessing cycle into a short investigation.
Identify the system boundary
Effective RCA starts by understanding which layer of the solution is involved. In most commerce environments, behavior can originate from platform capabilities, custom extensions, integrations, configuration, infrastructure, third-party services, or data.
For Virto Commerce projects, the distinction matters because the platform is intentionally extensible. The delivered solution includes Virto Commerce plus project-specific modules, configuration, integrations, deployment topology, and operational practices.
The objective is to identify the boundary objectively:
- Reproduce it on clean Virto Commerce. This is the single most powerful move. Take the same scenario — same request, same data shape — and run it on a vanilla install (
start-localgives you the full stack on your machine in one command). If it doesn’t reproduce on clean Virto, the cause is very likely in your customizations or data, not the platform. - Read the stack trace. If the exception originates in
YourCompany.CustomModule.*, that’s a strong signal. If it’s deep in a platform call but only under your specific data, suspect data or configuration. - Change one variable. Same code, same queries, more CPU headroom → problem disappears? Then it was resource saturation, not a code defect (more on this below).
- Check what changed. A module upgrade, a config change, a data import, a new integration — incidents usually have a trigger. Find the diff.
The faster the team identifies the relevant layer, the faster it can choose the right next action.
Collecting the right information is 90% of the solution
I’ll say that again because it’s the most important sentence in this article: gathering the right information correctly is about 90% of solving the problem. A clean diagnosis almost always resolves itself once the data is on the table. A messy one drags on for weeks regardless of how smart the people are.
Here’s what “the right information” looks like.
1. Trust your data — and the right tools
Use the real instruments, and learn to read them correctly:
- Azure Monitor / Application Insights — request and dependency durations, exceptions, performance counters.
- Error details and dependency calls — the actual exception, the actual SQL, the actual downstream call.
- CPU, memory, thread pool, and GC metrics — always together, never in isolation.
One critical caveat, because it bites everyone eventually: when CPU is above ~85–90%, Application Insights durations become misleading. Under CPU saturation the thread pool can’t schedule work fast enough, so AI reports wall-clock time (execution plus scheduling delay), not real execution time. A dependency that looks like it took 120 ms may have spent 110 ms waiting for CPU and 10 ms doing actual work. The database didn’t get slow — the host ran out of CPU. I wrote this up in detail here, and it’s required reading before you interpret any duration chart:
Do Not Blindly Trust Application Insights Durations When CPU Is Overloaded
The rule that follows: scale first, optimize second. Stabilize CPU headroom so your telemetry becomes trustworthy again, then profile and optimize.
2. Versions — what changed
Record the platform version and the version of every module, and note what changed recently. “It worked last week” is a clue; “we upgraded module X and changed appsettings on Tuesday” is half the answer.
Virto Commerce has native tools for exporting the current version of the installed platform and modules.
- Click on platform version
- Click either Download manifest, Download package or Copy
3. Reproduction and expected result
Write down: how to reproduce the problem, and what the expected result is. A request that “is slow” is not a report. “POST to graphql/AddOrUpdateCart takes 867 ms p95 under N concurrent users, we expect <150 ms” is a report someone can act on.
Application Insights allows exporting data in different formats, my favourite is Copy data, it exports end-to-end transactions as json document
4. Capture the live session, not a screenshot
When the problem is “a page or request is slow,” the most useful artifact you can hand us is the actual network session — not a photo of the timings.
A HAR file (HTTP Archive) is the de-facto standard for recording a full web session: every request and response, headers, payloads, and precise timings, all in one file your browser produces natively. It’s replayable, measurable, and removes the guesswork — which is exactly why providing one to support tends to expedite resolution dramatically.
We wrote up the how-to here:
Capture Web Session Traffic
It works in Chrome, Edge, or Firefox — Chrome DevTools is marginally the simplest:
- Right-click the page → Inspect.
- Open the Network tab.
- Reproduce the slow action, then click the download / Export HAR button (the tooltip > > reads “Export HAR”). Choosing “Save all as HAR with content” includes the > request/response bodies.
- Name the file and Save — all requests on the page are captured into that one file.
A few habits make the capture far more valuable:
- Reproduce, then record. Start the capture, perform only the slow action, then export. A focused HAR beats a noisy one full of unrelated traffic.
- Mind the secrets. A HAR contains headers, cookies, and tokens — capture from a test account where you can, and treat the file as sensitive when sharing.
- Add the context. Note the timestamp, the environment, what you did, and the expected-vs-actual result, so the timings in the HAR map to a story.
5. Let the AI assistants do the heavy lifting
This is genuinely changing how fast we diagnose issues, so use it. Virto Commerce now ships AI assistance across the platform, and our documentation is available to AI coding assistants directly:
- Virto OZ provides context-aware assistance and a developer copilot across the platform — it can locate, aggregate, and summarize platform data and support technical investigation in plain language. See Getting Started → AI Assistance in the Platform Dev Docs.
- Our docs are in Context7, so if you use Claude Code, Cursor, or any MCP-compatible assistant, you can have it reason over the current Virto Commerce documentation instead of relying on stale model memory: Virto Commerce Docs Are Now Available in Context7.
An AI assistant that can read your stack trace and the current docs will often point you at the right module in minutes. Use it as your first responder.
If you’re stuck, hand the problem over the right way
Sometimes you genuinely can’t tell where the issue lives, and you bring it to Virto Team. Wonderful — that’s what we’re here for. But how you hand it over determines how fast we can help. Context transfer is everything.
I’m attaching some real anonymized examples of how problems reach us, because the contrast is instructive. (Names removed — this is about the pattern, not the people. Every one of us has done the “wrong” version at some point, myself included.)
How not to hand over a problem
- A screenshot of a data grid instead of the data. A photo of an end-to-end transaction table tells me a request was slow. The exported transaction data tells me why. When the export exists and we get a screenshot of it, we’ve lost the most useful artifact.
- “The database is slow” with no evidence — or with evidence that actually points elsewhere. I’ve seen reports conclude “SQL is the suspect” and “maybe Elastic too,” sitting right next to a CPU-at-100% chart. As we just covered, at 100% CPU the durations are lying to you. The data was there; it was read backwards.
- Cause and effect reversed. “A simple entity update took 2 minutes, so the database is broken” — when the entity update was slow because the application was paused waiting for CPU. The symptom got promoted to a cause.
- A link only one team can open. Pasting an Azure portal link that support can’t access (“the link content is inaccessible to me, but maybe you have access”) moves the work back to us to even see the problem.
I’ll be candid about why this matters beyond convenience: when a handover is mostly screenshots and a conclusion, it often reads — fairly or not — as closing the ticket on our side rather than solving the problem. Even when that’s not the intent, it has the same effect: it slows everyone down. The fix isn’t more effort, it’s the right artifacts.
How to hand over a problem
- Export the data, don’t photograph it. The actual transaction details file, the JSON, the query results. If it exists as data, send the data.
- The exact reproduction. The exact request (or a request that reproduces it), inputs, and the environment it happened on.
- The full exception — message and stack trace as text, not a cropped screenshot.
- The metrics in context — CPU/memory alongside the durations, for the same time window, so nobody reads a saturated host as a slow database.
- Versions and recent changes.
- Access that works for us, or the content extracted and shared directly.
A few extra suggestions while we’re here: give us a non-prod environment we can poke at rather than only screenshots of prod; state your expected vs. actual explicitly; and if you have a hypothesis, tell us what you already ruled out and how. That turns a guessing game into a collaboration.
Meet us with an active stance
When you follow Virto Team’s recommendations, we move fast. We’ll get on a call, we’ll dig into it together, we’ll reproduce on clean Virto if needed. But this is a two-way street, and I want to be honest about the friction I sometimes see.
On a genuine P1, the rhythm too often goes: we send a recommendation or offer a call slot within the hour — and the reply comes two days later. On a P1, a two-day round trip isn’t a process, it’s an outage extended by latency. We can bring urgency, expertise, and tooling. What we can’t bring is your active engagement on your own production system. Meet us halfway and these incidents close in hours, not weeks.
Know where it is? Contribute the fix instead of waiting
Here’s a genuinely empowering option that partners underuse: if your investigation lands on something in the platform — a bug, a missing extensibility point, a small improvement — you don’t have to wait for us to schedule it. Virto Commerce is open source. Branch from dev, open a PR, sign the CLA on your first contribution, and every PR builds an Alpha release you can test before merge.
Contributing the fix is frequently faster than waiting in a queue, it gets the improvement to the whole ecosystem, and — selfishly for you — it means the next platform upgrade already contains your fix instead of re-breaking your patch. Here’s how:
How to Contribute to Virto Commerce
Production is a daily practice, not a one-time launch
The last shift I’ll ask for is the most strategic. Running in production is a daily discipline, not a milestone you pass at GoLive. Your project grows — new clients, new features, more data — and your technical team should see and understand what’s happening: the trends, what’s approaching a limit, what will need a change or an improvement and roughly when.
Virto Commerce’s architecture forgives a lot and simplifies a lot — but it can only help a team that’s watching. And please don’t forget to plan upgrades to stable Virto Commerce platform versions as part of that practice. We are improving the platform constantly — performance, scalability, tooling, AI assistance — and staying current means many problems get fixed for you before you ever hit them.
Bottom line
Commerce ecosystems naturally become more complex as businesses grow. RCA is one of the disciplines that helps teams manage that complexity in a structured way.
The strongest teams do not wait for symptoms to become urgent before learning how their systems behave. They prepare the evidence path, understand their architecture, keep telemetry readable, and collaborate from shared facts.
That is how complex commerce issues get resolved faster, and how the delivery process gets stronger each time.








