
The Core Dilemma: From Sacred Artifacts to Disposable Tools
In my practice, the most common source of infrastructure paralysis I encounter stems from a flawed mental model: the belief that all infrastructure must be permanent. I've walked into countless organizations where the database server from 2018 is treated like a family heirloom, its configuration a closely guarded secret passed down through generations of engineers. This 'Main Story' mindset, where every component is a persistent, critical character in your operational narrative, creates immense friction. Workflows become bogged down with change advisory boards, deployment freezes, and a pervasive fear of breaking the 'golden environment.' Conversely, I've also seen teams swing too far the other way, attempting to make everything ephemeral—a 'Side Quest' approach—only to find their data integrity compromised or their costs spiraling. The real expertise, I've learned, lies not in choosing one over the other, but in mastering the conceptual workflow for deciding which is which. This decision matrix impacts your team's velocity, your system's resilience, and ultimately, your business's ability to innovate. We must move beyond the binary and understand the process of classification itself.
A Tale of Two Databases: A 2024 Client Story
Last year, I worked with a fintech startup, 'AlphaPay,' that perfectly illustrated this tension. Their core transaction ledger ran on a monolithic PostgreSQL instance managed by a single senior DBA. The deployment process was a 40-step manual checklist—a classic 'Main Story' behemoth. Yet, their development and testing environments were supposed to be spun up from Terraform, but were always broken because the Terraform was six months out of date. The workflow was schizophrenic. My first recommendation wasn't a tool, but a process audit. We mapped every infrastructure component against two axes: Rate of Change and Business Criticality. The transaction database, with low change frequency but extreme criticality, was rightly a 'Main Story' component, but its management workflow was archaic. We didn't make it ephemeral; we made its persistent state immutable and auditable via rigorous Infrastructure-as-Code (IaC). The testing databases, however, were recast as pure 'Side Quests'—fully automated, disposable containers seeded with synthetic data. This conceptual separation, implemented over a 3-month period, reduced their environment provisioning time from 3 days to 20 minutes.
The key insight from AlphaPay and similar engagements is that the classification dictates the workflow. A 'Main Story' component demands workflows focused on immutability, rigorous change control, and comprehensive backup/DR strategies. A 'Side Quest' component demands workflows focused on automation, idempotency, and cost-aware lifecycle policies. Confusing the two creates process debt. For instance, applying a heavyweight CAB process to a temporary analytics cluster kills agility. Conversely, treating your customer identity store as disposable is a recipe for disaster. In the next sections, I'll break down the specific characteristics and management processes for each state, providing you with the framework to conduct your own audit.
Defining the States: Characteristics of Ephemeral and Persistent Workflows
To manage these states effectively, you must first define them clearly within your own operational context. From my experience, the definitions aren't about runtime duration alone; they're about the intended lifecycle and the management processes that support it. An ephemeral ('Side Quest') resource is defined by a workflow where creation and destruction are normal, frequent, and automated operations. Its existence is tied to a specific, temporary need—a CI/CD job, a developer's feature branch environment, a one-off data processing task. A persistent ('Main Story') resource is defined by a workflow where preservation, stability, and controlled evolution are the primary goals. Its existence is tied to a long-term business function. The critical mistake I see is assuming persistence equals 'manual' and ephemeral equals 'automated.' Both can and should be managed through code, but the nature of that code and the surrounding processes are fundamentally different.
The Ephemeral ('Side Quest') Mindset: Process as the Product
For ephemeral infrastructure, the most valuable artifact isn't the resource itself, but the flawless workflow that creates it. I instruct my clients to build processes where a developer can, with a single click or command, spawn a perfect replica of production topology for their branch. The resource's lifespan might be 2 hours or 2 days, but the workflow must be bulletproof. In a 2023 project for a gaming platform client, we implemented this using a combination of Terraform for base cloud resources and Ansible for application configuration, all triggered via a GitLab CI pipeline. The key metric we tracked was 'time-to-functional-environment,' which we drove down from 8 hours to under 12 minutes. The cost discipline is also part of the workflow: we used automated tags and Cloud Custodian rules to ensure any environment older than 48 hours sent alerts and was automatically torn down after 72 hours. This required a cultural shift to view these environments as utterly disposable, which was the hardest part.
The Persistent ('Main Story') Mindset: Evolution as the Goal
Persistent infrastructure management is a workflow of careful evolution, not recreation. The goal is to apply the principle of immutability: changes are made by replacing defined components in a controlled way, rather than editing in place. The workflow for a 'Main Story' component like a core Kubernetes cluster or customer database involves staged rollouts, canary deployments, and blue-green switches. For example, with a client's primary API gateway cluster, we never SSH'd into a live instance. Instead, our workflow baked new AMIs with Packer, deployed them to an auto-scaling group behind a load balancer, and terminated the old instances. The persistent state—the configuration defining that cluster—lived in version-controlled Terraform modules. The workflow's success was measured by uptime and mean time to recovery (MTTR), not spin-up time. According to research from the DevOps Research and Assessment (DORA) team, elite performers leverage these patterns to achieve both stability and speed, debunking the myth that robustness requires slowness.
Understanding these characteristic workflows is the foundation. The next step, which trips up many teams, is applying the right one to the right component. You cannot decide based on gut feeling; you need a structured decision framework, which I've developed and refined through trial and error across different industries.
The Decision Framework: A Step-by-Step Process for Classification
Based on my repeated engagements, I've formalized a four-step workflow to classify infrastructure components. This isn't a one-time exercise; it's a living process that should be revisited with every significant architectural change. I've found that facilitating this classification as a collaborative workshop with developers, ops engineers, and product managers yields the best results, as it surfaces hidden assumptions about what is truly 'core.' The framework evaluates each component against four key dimensions: Data Criticality, Change Velocity, External Dependency, and Recreation Cost. Let me walk you through how I apply it.
Step 1: Interrogate Data Criticality and Sovereignty
The first and most important question I ask is: "What unique, difficult-to-recreate data does this component hold or manage?" If the answer is 'customer transactions,' 'user identity mappings,' or 'core product catalog,' you are almost certainly looking at a 'Main Story' component. The workflow for such components must prioritize data durability, backup/restore testing, and point-in-time recovery capabilities. For a SaaS company I advised, their user file storage was initially treated as part of ephemeral app instances. After a catastrophic failure led to data loss, we reclassified it as persistent, moving it to a managed object storage service with versioning and cross-region replication. The workflow changed from 'include in instance template' to 'mount via defined IAM roles and lifecycle policies.' This step alone prevents the most severe business risks.
Step 2: Analyze the Rate of Change and Coupling
Next, examine how often the component's configuration or software needs to change, and how tightly it's coupled to other systems. High-change-rate components that are loosely coupled are prime candidates for ephemeral workflows. A classic example is a microservice for A/B testing or a batch data transformation job. Their logic changes frequently, and they can be deployed and scaled independently. The workflow here should be fully automated from commit to deployment. Conversely, a low-change-rate but highly coupled component, like a foundational authentication service, may be persistent. However, its persistence should be managed through immutable patterns. I use a simple 2x2 matrix (Change Rate vs. Coupling) with my clients to visualize this, which often reveals misaligned investments—like heavy manual processes around a rarely-changing, isolated component.
Steps 3 and 4 involve assessing external compliance or contractual dependencies (which often force persistence) and calculating the true cost—in time and money—of recreating the component from scratch. A component with high recreation cost but low data criticality might be a candidate for a hybrid approach: persistent core with ephemeral testing clones. By scoring each component across these dimensions, you create a data-driven map for your infrastructure portfolio. This map then informs which of the three primary management methodologies you should employ.
Comparing Methodologies: Three Approaches to Orchestrating Your Workflow
Once you've classified your components, you need to select the operational methodology that enforces the desired workflow. In my practice, I've seen three dominant patterns emerge, each with distinct pros, cons, and ideal use cases. It's crucial to understand that these are not mutually exclusive; most mature organizations I work with use a blend, but with clear boundaries.
Methodology A: The Pure IaC Pipeline (Declarative & Ephemeral-First)
This approach treats all infrastructure as code defined in declarative languages like Terraform HCL or Pulumi. The workflow is a GitOps pipeline: commit to main branch triggers a plan/apply cycle. It's excellent for enforcing consistency and is inherently geared towards ephemeral patterns. I recommend this for greenfield projects or for teams managing predominantly cloud-native, stateless services. A client in the e-commerce space used this to manage over 200 microservices, achieving incredible deployment frequency. However, the limitation I've observed is with stateful 'Main Story' components. Terraform state files for critical databases become a single point of failure and a source of team friction during complex migrations. This method works best when your 'Main Story' footprint is small and managed via cloud provider's native managed services (e.g., AWS RDS, Azure SQL), which abstract away the server-level persistence.
Methodology B: The Immutable Platform (Platform-as-a-Service Mindset)
Here, you build or buy a central internal platform (like a Kubernetes platform with Backstage or a cloud platform team's curated service catalog). The workflow for developers is a 'golden path'—they request a database or a queue, and the platform provisions it according to governed templates. This methodology excels at managing persistent 'Main Story' components with guardrails. The platform team ensures backups, security patches, and compliance for these core services. I helped a large financial services client implement this, reducing their 'Main Story' provisioning time from weeks to hours while improving audit compliance. The downside is potential bottlenecking and reduced flexibility for teams needing custom 'Side Quest' infrastructure. It's ideal for large organizations with stringent compliance needs and a clear separation between platform and product teams.
Methodology C: The Hybrid, Context-Aware Orchestrator
This is the most advanced pattern I implement with clients who have complex, mixed estates. It uses a meta-orchestrator (like Terraform Cloud, Spacelift, or custom scripts using cloud provider SDKs) that chooses a sub-workflow based on the component's classification. For example, a commit to a 'database-config' repo might trigger a workflow with manual approval, detailed impact analysis, and a precise backup/restore step before apply. A commit to a 'feature-environment' repo triggers a fully automated, cost-capped, destroy-after-48-hours pipeline. This requires upfront investment in classification tagging and pipeline logic, but it offers the greatest flexibility. A media company I worked with used this to manage their legacy video rendering farm (persistent) alongside their dynamic ad-targeting microservices (ephemeral) on the same cloud account without conflict.
| Methodology | Best For | Primary Workflow | Key Limitation |
|---|---|---|---|
| Pure IaC Pipeline | Ephemeral-heavy, stateless, cloud-native apps; small teams. | Git commit → automated plan/apply. | Struggles with complex stateful resources; state file management. |
| Immutable Platform | Large orgs with strict compliance; centralizing 'Main Story' components. | Developer request → platform team's templated, governed provisioning. | Can become a innovation bottleneck; platform team overhead. |
| Hybrid Orchestrator | Mixed estates with clear classification; mature DevOps culture. | Context-aware pipelines applying different rigor based on component type. | High initial complexity and maintenance cost for pipeline logic. |
Choosing the right methodology, or blend, depends entirely on the output of your classification exercise. Trying to force a pure IaC pipeline onto a legacy monolith with a critical database is a path to frustration, just as using a heavyweight platform process for a data science team's experimental clusters will kill productivity.
Real-World Application: Case Studies from My Consulting Practice
Let's move from theory to the concrete. I'll share two detailed case studies where applying this framework of classification and methodology selection led to transformative outcomes. These are not sanitized success stories; they include the challenges we faced and the specific metrics we tracked.
Case Study 1: The Gaming Studio's Unplayable Test Environments
In early 2025, I was engaged by 'Nexus Interactive,' a mid-sized game studio. Their pain point was that developer productivity was plummeting because test environments for their multiplayer backend were constantly broken or out of sync. Their infrastructure was a monolith of Terraform and manual scripts, treating everything with equal permanence. Their 'staging' environment had become a 'zombie'—neither a reliable production proxy nor a disposable sandbox. We ran the classification workshop. The core game state database (player inventories, match results) was clearly 'Main Story.' The game server fleets, however, were logically 'Side Quests'—they needed to scale to zero overnight and spin up new versions dozens of times a day for testing. The problem was their workflow conflated them. Our solution was to split the Terraform codebase. The database and networking lived in a 'core' module with strict change controls. The game server instances were moved to a separate project using Pulumi for more flexible logic, integrated into their CI/CD. We implemented Spot instances for cost and used pre-baked AMIs for fast launch. The result? Their 'time-to-test' for a new build dropped from 90 minutes to 7 minutes. Environment cost fell by 65% because we aggressively terminated idle fleets. The key was accepting that the server instances were ephemeral characters in the development 'Side Quest,' while the player data was the 'Main Story.'
Case Study 2: The Enterprise's Fear of the Cloud Console
My second example is a Fortune 500 manufacturing client (2024) with a mandate to move legacy ERP extensions to the cloud. Their IT governance was so fearful of cost overruns and security breaches that they banned developers from the cloud console and required all infrastructure to be managed by a central team via tickets—a brutal 'Main Story' process for everything. This created a 6-week backlog for provisioning a simple test database. We introduced the hybrid orchestrator model. First, we classified components: the production ERP integration layer was 'Main Story,' but the dozens of developer sandboxes and CI environments were 'Side Quests.' We then built a self-service portal using Terraform Cloud and ServiceNow integration. Requests for 'Side Quest' resources were auto-approved with hard spending limits and a 30-day auto-destruct policy. Requests for 'Main Story' resources still went to the central team but were fulfilled from pre-approved, secure modules. This changed the workflow from 'submit ticket and wait' to 'get sandbox now, request production later.' Over six months, developer satisfaction scores for tooling improved by 40 points, and the central team's backlog was eliminated, allowing them to focus on securing and optimizing the true 'Main Story' assets.
These cases show that the payoff isn't just technical; it's cultural and economic. The right workflow unlocks speed where it's safe and enforces rigor where it's necessary.
Common Pitfalls and How to Avoid Them: Lessons from the Trenches
Even with a good framework, teams make predictable mistakes. Based on my experience, here are the most common pitfalls I see when clients try to implement these concepts, and my advice on how to sidestep them.
Pitfall 1: The 'Persistence by Default' Bias
This is the most insidious issue. Without conscious effort, humans tend to preserve what exists. A temporary logging cluster set up for debugging becomes a permanent fixture because no one is tasked with killing it. The workflow lacks a 'destruction' trigger. My solution is to mandate that all provisioning workflows must include a de-provisioning policy by default. In AWS, use mandatory tags like 'ttl' or 'owner' and deploy automated cleanup tools like AWS Instance Scheduler or open-source options like Cloud Custodian. I make this a non-negotiable rule in my engagements: if you can't describe its auto-destruct condition, you cannot build it. This forces the 'Side Quest' mindset from the start.
Pitfall 2: Treating IaC as a Silver Bullet for Persistence
Teams often believe that putting a 'Main Story' component like a database into Terraform solves the persistence management problem. It doesn't. IaC manages the *definition*, not the *runtime state*. A critical database needs a separate, robust workflow for backups, point-in-time recovery, and major version upgrades—processes that often fall outside IaC's scope. I advise clients to use IaC to deploy the managed database service (e.g., AWS RDS) but to complement it with dedicated backup/DR orchestration (e.g., AWS Backups) and documented runbooks for failover. The workflow for a persistent component is multi-layered.
Pitfall 3: Ignoring the Human and Cost Factors
A technically perfect ephemeral system can fail if developers don't trust it or if it costs too much. I once designed an elegant on-demand staging environment system that no one used because developers feared losing 'their' data. The fix was to implement easy data snapshotting and restoration as part of the teardown/creation workflow. On cost, I've seen automated ephemeral systems left running over weekends, generating shocking bills. The workflow must include budget alerts and enforce 'schedule to off' patterns. According to Gartner, through 2026, 40% of organizations will overshoot cloud budgets due to ungoverned, ephemeral resource sprawl. Your process must account for human behavior and financial controls.
Avoiding these pitfalls requires viewing your infrastructure not as a set of resources, but as a set of interconnected workflows with human and financial feedback loops. Your processes must be designed for the real world, not an ideal diagram.
Building Your Action Plan: A Step-by-Step Guide to Getting Started
Feeling overwhelmed is natural. Here is the exact, actionable 6-step plan I give my clients at the start of our engagement to begin mastering this duality. You can start this next week.
Step 1: Conduct an Infrastructure Inventory & Tagging Audit
You cannot manage what you cannot see. Use your cloud provider's CLI or tools like Steampunk to list all resources. The immediate goal is to tag every resource with at least two key tags: 'owner' (team/application) and 'lifespan-intent' (start with values: 'persistent-core', 'persistent-support', 'ephemeral-task', 'unknown'). This first pass will be messy, but it creates a baseline. In my experience, 30% of resources are often 'unknown' or misclassified, which is the problem you're solving.
Step 2: Run the Classification Workshop
Gather leads from 2-3 key product teams and your platform/ops team. Pick 5-10 major components from your inventory. Use the framework from Section 3 (Data, Change, Dependency, Cost) to debate and classify each one. The goal is not consensus for its own sake, but to expose differing mental models. Document the agreed classification and the reasoning. This 2-hour workshop is the most valuable step for aligning your organization's perspective.
Step 3: Map Current vs. Desired Workflow
For each classified component, whiteboard its current provisioning, change, and decommissioning process. Then, whiteboard the desired process based on its classification. The gap between these two diagrams is your action plan. For 'Side Quest' components, the desired workflow should be 100% automated from a developer's intent. For 'Main Story' components, the desired workflow should emphasize safety and audit trails.
Step 4: Implement One High-Impact, High-Visibility Change
Don't boil the ocean. Choose one component where the gap is large and the impact is visible—like developer sandbox environments. Implement the new workflow for just that component. Use this as a proof-of-concept to work out your tooling and process kinks. Measure the before-and-after metrics (provisioning time, cost, developer satisfaction). A successful, small win builds momentum and trust for the broader transformation.
Steps 5 and 6 involve socializing the win, iterating on the pattern, and gradually applying the new workflow model to more components, evolving your chosen methodology (from Section 4) as you scale. Remember, this is a journey of continuous refinement, not a one-time project.
Frequently Asked Questions (FAQ)
Q: Can a single component be both ephemeral and persistent?
A: In my view, no—the management workflow must choose one primary mode. However, you can have a persistent core with ephemeral read-replicas or caching layers. The key is that each instance has a clear lifecycle intent. Trying to manage one instance with both mindsets leads to process confusion.
Q: How do you handle data migration when reclassifying a component from persistent to ephemeral?
A: This is a critical operation. The data must be migrated to a new, properly classified persistent store first. For example, if moving configuration out of a 'temporary' VM's filesystem, you'd first export it to a managed parameter store or database. Only after the data is safely housed in a persistent component do you convert the original host to be truly ephemeral. I always plan these as mini-projects with backup rollback steps.
Q: Doesn't this create more complexity than a one-size-fits-all approach?
A: Initially, yes. You are trading simple, uniform complexity for a more nuanced, tailored complexity. However, in my experience across dozens of clients, the tailored approach reduces operational complexity in the long run because processes are fit-for-purpose. The 'one-size-fits-all' model usually means applying the most restrictive ('Main Story') process to everything, which is the most complex and costly approach of all.
Q: How do you get buy-in from leadership focused on cost?
A: I frame it in financial terms. Ephemeral patterns, when governed, reduce waste from idle resources. Persistent patterns, when well-managed, reduce risk and the massive unplanned costs of outages or data loss. I present a TCO analysis comparing the cost of the current chaotic state to the projected cost of a managed dual-state approach, which almost always shows significant savings and risk reduction.
Conclusion: Mastering the Narrative of Your Infrastructure
The journey from infrastructure chaos to clarity is a journey of narrative control. You are the author of your system's operational story. By consciously deciding which components are 'Main Story' characters—to be developed with depth, care, and a long arc—and which are 'Side Quests'—valuable, enriching, but ultimately disposable adventures—you gain authorial control over cost, complexity, and velocity. In my ten years of guiding teams through this, the single greatest outcome isn't a technical metric; it's the cultural shift. Developers gain autonomy and speed within safe boundaries. Operators gain the ability to focus on truly critical systems. The business gains agility and resilience. Start with the classification workshop. Embrace the duality. And remember, the most powerful infrastructure is not the one that never changes, but the one whose change process is masterfully managed.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!