How to Run a Real Root Cause Analysis on a Virto Commerce Project (and Solve Problems Faster, Together)

A note to the people who own the outcome

This is for the business owners, delivery leads, implementation partners, architects, and engineering managers behind ecommerce solutions.

As commerce ecosystems grow, they naturally collect more storefront logic, integrations, ERP dependencies, custom extensions, infrastructure choices, configuration, third-party services, and data patterns. A single symptom can appear in one layer while originating somewhere else entirely.

Structured Root Cause Analysis (RCA) helps teams manage that complexity. It gives everyone a shared way to move from symptom to evidence, from evidence to cause, and from cause to durable resolution.

So consider this the opening article in a series.

The goal is simple: help teams reach the real cause faster, make better technical decisions under pressure, and improve delivery quality across the commerce solution.


The mindset that resolves issues faster

In complex commerce environments, symptoms often point teams toward the wrong system first. A slow checkout might originate in infrastructure, integrations, configuration, data, customizations, platform behavior, or the interaction between several of those layers.

The useful question is which system behavior produced the outcome we are seeing, and what evidence will prove it.

A strong RCA process keeps the team focused on three practical questions:

  • What can we stabilize right now?
  • What evidence do we need to understand the behavior?
  • What change will prevent the issue from recurring?

This is the posture that shortens resolution cycles. It treats RCA as an engineering and delivery discipline, not as a reactive checklist.


Prepare before the pressure is on

Modern commerce teams usually have enough tools to investigate well: telemetry, logs, profilers, distributed tracing, network captures, database diagnostics, deployment history, and environment configuration. The gap is rarely tool availability alone. The gap is whether the team knows how to use those tools together.

In calm weather, every delivery team should know the basics:

â—Ź How the solution is wired across storefront, platform, integrations, infrastructure, and data.
â—Ź Where telemetry, logs, errors, traces, and deployment history live.
â—Ź Who can access the right environments and whether a production-like copy exists for investigation.
â—Ź How to reproduce a behavior cleanly and observe it with reliable metrics.

Preparation is not overhead. It is what turns a production issue from a long guessing cycle into a short investigation.


Identify the system boundary

Effective RCA starts by understanding which layer of the solution is involved. In most commerce environments, behavior can originate from platform capabilities, custom extensions, integrations, configuration, infrastructure, third-party services, or data.

For Virto Commerce projects, the distinction matters because the platform is intentionally extensible. The delivered solution includes Virto Commerce plus project-specific modules, configuration, integrations, deployment topology, and operational practices.

The objective is to identify the boundary objectively:

  • Reproduce it on clean Virto Commerce. This is the single most powerful move. Take the same scenario — same request, same data shape — and run it on a vanilla install (start-local gives you the full stack on your machine in one command). If it doesn’t reproduce on clean Virto, the cause is very likely in your customizations or data, not the platform.
  • Read the stack trace. If the exception originates in YourCompany.CustomModule.*, that’s a strong signal. If it’s deep in a platform call but only under your specific data, suspect data or configuration.
  • Change one variable. Same code, same queries, more CPU headroom → problem disappears? Then it was resource saturation, not a code defect (more on this below).
  • Check what changed. A module upgrade, a config change, a data import, a new integration — incidents usually have a trigger. Find the diff.

The faster the team identifies the relevant layer, the faster it can choose the right next action.


Collecting the right information is 90% of the solution

I’ll say that again because it’s the most important sentence in this article: gathering the right information correctly is about 90% of solving the problem. A clean diagnosis almost always resolves itself once the data is on the table. A messy one drags on for weeks regardless of how smart the people are.

Here’s what “the right information” looks like.

1. Trust your data — and the right tools

Use the real instruments, and learn to read them correctly:

  • Azure Monitor / Application Insights — request and dependency durations, exceptions, performance counters.
  • Error details and dependency calls — the actual exception, the actual SQL, the actual downstream call.
  • CPU, memory, thread pool, and GC metrics — always together, never in isolation.

One critical caveat, because it bites everyone eventually: when CPU is above ~85–90%, Application Insights durations become misleading. Under CPU saturation the thread pool can’t schedule work fast enough, so AI reports wall-clock time (execution plus scheduling delay), not real execution time. A dependency that looks like it took 120 ms may have spent 110 ms waiting for CPU and 10 ms doing actual work. The database didn’t get slow — the host ran out of CPU. I wrote this up in detail here, and it’s required reading before you interpret any duration chart:

:backhand_index_pointing_right: Do Not Blindly Trust Application Insights Durations When CPU Is Overloaded

The rule that follows: scale first, optimize second. Stabilize CPU headroom so your telemetry becomes trustworthy again, then profile and optimize.

2. Versions — what changed

Record the platform version and the version of every module, and note what changed recently. “It worked last week” is a clue; “we upgraded module X and changed appsettings on Tuesday” is half the answer.

:writing_hand:
Virto Commerce has native tools for exporting the current version of the installed platform and modules.

  1. Click on platform version
  2. Click either Download manifest, Download package or Copy

3. Reproduction and expected result

Write down: how to reproduce the problem, and what the expected result is. A request that “is slow” is not a report. “POST to graphql/AddOrUpdateCart takes 867 ms p95 under N concurrent users, we expect <150 ms” is a report someone can act on.

:writing_hand:
Application Insights allows exporting data in different formats, my favourite is Copy data, it exports end-to-end transactions as json document

4. Capture the live session, not a screenshot

When the problem is “a page or request is slow,” the most useful artifact you can hand us is the actual network session — not a photo of the timings.

A HAR file (HTTP Archive) is the de-facto standard for recording a full web session: every request and response, headers, payloads, and precise timings, all in one file your browser produces natively. It’s replayable, measurable, and removes the guesswork — which is exactly why providing one to support tends to expedite resolution dramatically.

We wrote up the how-to here: :backhand_index_pointing_right: Capture Web Session Traffic

:writing_hand:
It works in Chrome, Edge, or Firefox — Chrome DevTools is marginally the simplest:

  1. Right-click the page → Inspect.
  2. Open the Network tab.
  3. Reproduce the slow action, then click the download / Export HAR button (the tooltip > > reads “Export HAR”). Choosing “Save all as HAR with content” includes the > request/response bodies.
  4. Name the file and Save — all requests on the page are captured into that one file.

A few habits make the capture far more valuable:

  • Reproduce, then record. Start the capture, perform only the slow action, then export. A focused HAR beats a noisy one full of unrelated traffic.
  • Mind the secrets. A HAR contains headers, cookies, and tokens — capture from a test account where you can, and treat the file as sensitive when sharing.
  • Add the context. Note the timestamp, the environment, what you did, and the expected-vs-actual result, so the timings in the HAR map to a story.

5. Let the AI assistants do the heavy lifting

This is genuinely changing how fast we diagnose issues, so use it. Virto Commerce now ships AI assistance across the platform, and our documentation is available to AI coding assistants directly:

  • Virto OZ provides context-aware assistance and a developer copilot across the platform — it can locate, aggregate, and summarize platform data and support technical investigation in plain language. See Getting Started → AI Assistance in the Platform Dev Docs.
  • Our docs are in Context7, so if you use Claude Code, Cursor, or any MCP-compatible assistant, you can have it reason over the current Virto Commerce documentation instead of relying on stale model memory: Virto Commerce Docs Are Now Available in Context7.

An AI assistant that can read your stack trace and the current docs will often point you at the right module in minutes. Use it as your first responder.


If you’re stuck, hand the problem over the right way

Sometimes you genuinely can’t tell where the issue lives, and you bring it to Virto Team. Wonderful — that’s what we’re here for. But how you hand it over determines how fast we can help. Context transfer is everything.

I’m attaching some real anonymized examples of how problems reach us, because the contrast is instructive. (Names removed — this is about the pattern, not the people. Every one of us has done the “wrong” version at some point, myself included.)

:cross_mark: How not to hand over a problem

  • A screenshot of a data grid instead of the data. A photo of an end-to-end transaction table tells me a request was slow. The exported transaction data tells me why. When the export exists and we get a screenshot of it, we’ve lost the most useful artifact.
  • “The database is slow” with no evidence — or with evidence that actually points elsewhere. I’ve seen reports conclude “SQL is the suspect” and “maybe Elastic too,” sitting right next to a CPU-at-100% chart. As we just covered, at 100% CPU the durations are lying to you. The data was there; it was read backwards.
  • Cause and effect reversed. “A simple entity update took 2 minutes, so the database is broken” — when the entity update was slow because the application was paused waiting for CPU. The symptom got promoted to a cause.
  • A link only one team can open. Pasting an Azure portal link that support can’t access (“the link content is inaccessible to me, but maybe you have access”) moves the work back to us to even see the problem.

I’ll be candid about why this matters beyond convenience: when a handover is mostly screenshots and a conclusion, it often reads — fairly or not — as closing the ticket on our side rather than solving the problem. Even when that’s not the intent, it has the same effect: it slows everyone down. The fix isn’t more effort, it’s the right artifacts.

:white_check_mark: How to hand over a problem

  • Export the data, don’t photograph it. The actual transaction details file, the JSON, the query results. If it exists as data, send the data.
  • The exact reproduction. The exact request (or a request that reproduces it), inputs, and the environment it happened on.
  • The full exception — message and stack trace as text, not a cropped screenshot.
  • The metrics in context — CPU/memory alongside the durations, for the same time window, so nobody reads a saturated host as a slow database.
  • Versions and recent changes.
  • Access that works for us, or the content extracted and shared directly.

A few extra suggestions while we’re here: give us a non-prod environment we can poke at rather than only screenshots of prod; state your expected vs. actual explicitly; and if you have a hypothesis, tell us what you already ruled out and how. That turns a guessing game into a collaboration.


Meet us with an active stance

When you follow Virto Team’s recommendations, we move fast. We’ll get on a call, we’ll dig into it together, we’ll reproduce on clean Virto if needed. But this is a two-way street, and I want to be honest about the friction I sometimes see.

On a genuine P1, the rhythm too often goes: we send a recommendation or offer a call slot within the hour — and the reply comes two days later. On a P1, a two-day round trip isn’t a process, it’s an outage extended by latency. We can bring urgency, expertise, and tooling. What we can’t bring is your active engagement on your own production system. Meet us halfway and these incidents close in hours, not weeks.


Know where it is? Contribute the fix instead of waiting

Here’s a genuinely empowering option that partners underuse: if your investigation lands on something in the platform — a bug, a missing extensibility point, a small improvement — you don’t have to wait for us to schedule it. Virto Commerce is open source. Branch from dev, open a PR, sign the CLA on your first contribution, and every PR builds an Alpha release you can test before merge.

Contributing the fix is frequently faster than waiting in a queue, it gets the improvement to the whole ecosystem, and — selfishly for you — it means the next platform upgrade already contains your fix instead of re-breaking your patch. Here’s how: :backhand_index_pointing_right: How to Contribute to Virto Commerce


Production is a daily practice, not a one-time launch

The last shift I’ll ask for is the most strategic. Running in production is a daily discipline, not a milestone you pass at GoLive. Your project grows — new clients, new features, more data — and your technical team should see and understand what’s happening: the trends, what’s approaching a limit, what will need a change or an improvement and roughly when.

Virto Commerce’s architecture forgives a lot and simplifies a lot — but it can only help a team that’s watching. And please don’t forget to plan upgrades to stable Virto Commerce platform versions as part of that practice. We are improving the platform constantly — performance, scalability, tooling, AI assistance — and staying current means many problems get fixed for you before you ever hit them.


Bottom line

Commerce ecosystems naturally become more complex as businesses grow. RCA is one of the disciplines that helps teams manage that complexity in a structured way.

The strongest teams do not wait for symptoms to become urgent before learning how their systems behave. They prepare the evidence path, understand their architecture, keep telemetry readable, and collaborate from shared facts.

That is how complex commerce issues get resolved faster, and how the delivery process gets stronger each time.

1 Like

A field guide to common problems (and how to think about each)

Most production issues fall into a handful of buckets. Recognizing the bucket early saves you the most time.

1. High CPU — often just “the project grew up”

The most common one, and frequently the least alarming. Your project succeeded: more clients, more catalog, more features, more traffic. The system simply needs more resources. Don’t start by hunting for a code villain. Start by scaling and isolating:

  • Scale first to restore headroom and trustworthy metrics.
  • Isolate workloads — segment the frontend onto its own instances, separate from background jobs and indexation, so administrative tasks never impact customers. Virto Commerce’s scalability options are built for exactly this.
  • Then keep analyzing on a copy of production to find whether there’s also a hot path worth optimizing.

(Real example from our recent work: a recurring full-reindex job firing every 15 minutes drove a production host to 100% CPU. The fix was to disable the brute-force job, switch to event-based indexation, and reserve frequent rebuilds for non-prod — but the durable answer was also to scale and segment, and to investigate the rebuild cost on a non-prod copy before re-enabling anything.)

2. Slow database queries

Real query slowness exists — but confirm CPU is healthy first, because saturation masquerades as DB slowness constantly. Once CPU is ruled out: capture the actual SQL and its duration, check the execution plan, look for missing indexes or N+1 patterns introduced by custom code, and verify whether non-prod resources (which may “sleep” when idle) simply need a warm-up that prod would never need.

3. Errors / exceptions — the easy ones

Counterintuitively, a hard error is the best case. You can see the request, you can see the stack trace — the cause is usually right there. The only work left is assembly: request + exception + recent change + the layer the stack points to. Send those four things as text, and these often resolve same-day.

4. Cold-start / warm-up and caching effects

Services that idle on non-prod can take time to warm up; first requests look terrible, subsequent ones are instant. Before declaring a performance bug, check whether you’re measuring a cold cache or a sleeping resource. This is normal for non-prod and a configuration choice for prod — not a platform defect.

5. Configuration and dependency-cycle issues

Misconfigured modules, conflicting settings, or dependency cycles introduced by customization show up as instability that’s hard to pin on any single request. When a problem is diffuse rather than tied to one call, suspect configuration and module wiring before code logic.

1 Like

Recommended reading

Bookmark these. Read them before the incident, not during it — a team fluent in this material diagnoses .NET performance issues far faster, and most of these tutorials map almost one-to-one onto the problems above.

Start with the Virto Commerce material:

Then the Microsoft .NET documentation. This is the canonical, vendor-grade reference for everything we discussed — and it’s free:

Please feel free to share your references and thoughts.