The Modern Data Stack Has a Reliability Problem
Why the modern data stack creates more points of failure and what to do about it.
By Pallisade Team
The modern data stack promised us agility. Composable tools. Best-of-breed solutions. No more monolithic data warehouses.
What we got: 15 vendors, 47 potential failure points, and a Slack channel that never stops alerting.
The Complexity Explosion
A typical modern data stack in 2025:
Sources (5+)
├── PostgreSQL (production)
├── Stripe API
├── Salesforce
├── Google Analytics
└── Segment
Ingestion (2-3)
├── Fivetran
├── Airbyte
└── Custom scripts
Transformation (2)
├── dbt Cloud
└── Spark jobs
Warehouse (1)
└── Snowflake/BigQuery/Databricks
Orchestration (1-2)
├── Airflow
└── dbt Cloud scheduler
BI Layer (2-3)
├── Looker
├── Mode
└── Hex notebooks
Reverse ETL (1)
└── Hightouch/Census
That's 15+ tools that need to work together, every day, without failure.
The probability of at least one failure per day? Nearly 100%.
Where Things Break
1. The Ingestion Layer
| Failure Mode | Frequency | Impact |
|---|---|---|
| API rate limiting | Weekly | Incomplete data |
| Schema changes upstream | Monthly | Pipeline crashes |
| Connector paused or stale | Quarterly | Silent failures |
| Sync lag exceeding SLA | Varies | Stale downstream tables |
Real example: Salesforce changed their API response format. Fivetran handled it gracefully. Your custom Python script didn't. You found out 3 days later.
2. The Transformation Layer
dbt is powerful. But power creates complexity:
- 300+ models with interdependencies
- Incremental models that can get out of sync
- Tests that pass but don't catch real issues
- Undocumented changes that break downstream
3. The Orchestration Layer
Your DAGs are complex:
Source → Ingest → Stage → Transform → Mart → BI → Reverse ETL
One failure cascades. But do you know which dashboards are affected when stg_orders fails?
4. The BI Layer
"The dashboard is showing weird numbers."
Is it:
- Bad source data?
- Failed transformation?
- Stale cache?
- Wrong filter?
- User error?
Good luck debugging without lineage.
The Monitoring Fragmentation
Each tool has its own monitoring:
- Fivetran: Sync status dashboard
- dbt Cloud: Job history
- Airflow: DAG view
- Snowflake: Query history
- Looker: Usage analytics
No single place to answer: "Is my data reliable right now?"
What We Actually Need
Unified Reliability View
One dashboard. Four questions answered:
- Is data fresh? Freshness vs SLO for all critical tables
- Are pipelines healthy? Success rate, failure patterns, MTTR
- Is quality holding? NULL rates, volume trends, schema stability
- Who owns what? Pipeline ownership, on-call routing, coverage gaps
Automated Issue Detection
Not just "pipeline failed" but:
- Which downstream assets are affected (lineage-driven)
- What the business impact is
- Who owns the fix
- What the fix actually is
Auto-Remediation
Don't just alert — fix.
- Missing freshness test -> Generate and PR it
- Schema drift -> Auto-generate validation checks
- Volume anomaly -> Trace upstream to the failing connector
- Pipeline timeout -> Retry config + alert threshold adjustment
The Coverage Score Approach
Instead of 15 dashboards, one score per asset — rated 0-100 across 7 dimensions:
Table: orders (Coverage: 72/100)
Breakdown:
├── Freshness: 85/100
├── Volume: 78/100
├── Schema: 90/100
├── Quality: 65/100
├── Lineage: 80/100
├── Ownership: 55/100
└── Documentation: 50/100
Top Issues:
- [HIGH] 3 tables with no freshness monitor
- [MED] Volume anomaly on events (-92% vs 7d avg)
- [MED] 5 tables with no assigned owner
Leadership gets a number. Engineering gets actionable fixes.
Making the Modern Stack Reliable
Step 1: Consolidate Visibility
Stop switching between 8 tabs. Get one view.
- Connect all tools in the Pipeline tab
- Let Pallisade auto-discover lineage across the stack
- See every source, model, and dashboard in one graph
Step 2: Define SLOs
Not every table needs 99.9% uptime. Define what matters:
| Table | SLO | Why |
|---|---|---|
| transactions | 99.9% | Revenue reporting |
| user_events | 99% | Product analytics |
| experiments | 95% | A/B test results |
| logs | 90% | Debugging only |
Step 3: Implement Layered Monitoring
Source Layer:
└── Row count checks, schema validation, sync lag
Staging Layer:
└── Freshness tests, null checks
Mart Layer:
└── Business logic tests, uniqueness, range validation
BI Layer:
└── Dashboard freshness proxy, upstream break risk
Step 4: Automate Remediation
For common issues, have the assistant generate ready-to-apply fixes:
- Freshness test templates per table
- Pipeline retry configurations
- Schema validation configs
- Downstream impact patches
Step 5: Track and Report
Weekly coverage reports to leadership:
> "Average coverage improved from 72 to 78 this week. We resolved 12 issues, including adding freshness monitors to 5 critical tables and closing 2 volume anomalies."
The Path Forward
The modern data stack isn't going away. It's too valuable.
But we need to stop pretending that "best-of-breed" means "automatically reliable."
Reliability is a feature you have to build. And it starts with:
- Unified visibility
- Clear SLOs
- Layered monitoring
- Automated remediation
- Continuous coverage scoring
Ready to see your modern data stack's reliability score?
Connect your tools. See your coverage. Fix your issues.
Tags:
Want to See Pallisade on Your Stack?
Our team can walk you through how Pallisade monitors, diagnoses, and fixes data quality issues across your pipeline.