Prometheus + Grafana: Monitoring Stack That Won't Eat Your Budget

Datadog sends bill for $5,000 per month. New Relic pleases with $3,500. "Why does monitoring cost like two engineers' salaries?" — you ask CFO. "There's free alternative," — DevOps responds. "Prometheus + Grafana, open source, zero licensing costs." Sounds great, right?

Three months later it turns out: "free" monitoring requires dedicated SRE half-time ($75,000/year), $200/month VPS for storing metrics, and retention problems nobody can solve. Total cost of ownership turns out not so free.

But story doesn't end here. Research shows that properly configured Prometheus + Grafana still 3-5x cheaper than commercial solutions for typical scenarios. Key word — "properly configured".

This article is honest comparison of Prometheus + Grafana versus Datadog/New Relic. Real costs (not just infrastructure), when self-hosted is profitable, when better to pay for managed service, and how not to fail migration from commercial monitoring.

Why Everyone Praises Prometheus + Grafana

Open source. Prometheus and Grafana — both free open source projects. Download, install, works. No licensing fees, no limits on metrics number or hosts.

Industry standard for Kubernetes. Prometheus became de-facto standard for container monitoring. Kubernetes native integration, ServiceDiscovery works out of box, all exporters available.

Powerful query language. PromQL allows complex aggregations and calculations. Percentiles, rate calculations, predictions — all built-in.

Visualization flexibility. Grafana connects to any data source. Prometheus, InfluxDB, Elasticsearch, PostgreSQL — all on one dashboard. Community created thousands of ready dashboards for popular services.

No vendor lock-in. Data stored locally, in standard format. Want to migrate — migrate anywhere. With Datadog or New Relic migration means rewriting all dashboards and alerts.

Real Cost: Not Just Infrastructure

Datadog loves saying Prometheus is "free but expensive to operate." They're right, but let's calculate honestly.

Direct Infrastructure Expenses

VPS for Prometheus. Minimum 4GB RAM for small deployment (10-20 targets). For 50-100 targets need 8-16GB. Typical setup: $40-80/month on Hetzner, DigitalOcean, Vultr.

Data storage. Prometheus stores metrics locally. Retention 15-30 days usually fits in 50-100GB SSD. For long-term storage need Thanos or Cortex, adding S3 costs (~$5-10/month for 100GB).

Grafana server. Can run on same VPS as Prometheus (shared hosting), or separately. Grafana lightweight, 1-2GB RAM sufficient. Shared deployment saves $10-20/month.

Alertmanager and additional services. Another 1-2GB RAM. Usually fits on main VPS.

Total infrastructure: $50-100/month for small-medium deployment (up to 100 hosts/containers).

Hidden Costs: Engineering Time

Initial setup. Prometheus configuration, Grafana dashboards, alerting rules. First time takes 20-40 hours for experienced engineer. If learning from scratch — add 40-60 hours.

Maintenance. Prometheus updates, disk space management, troubleshooting performance issues. Analytics shows: 1-2 hours per week for stable system. That's 50-100 hours per year.

Incidents and debugging. When something breaks — need to figure out yourself. Community helps, but responsibility on you. Budget 20-30 hours per year for troubleshooting.

If senior SRE hour costs $75-100, engineering time comes to $7,500-13,000 per year or ~$625-1,080/month.

Total Cost of Ownership: Honest Comparison

Self-hosted Prometheus + Grafana: $50-100/month infrastructure + $625-1,080 engineering time = $675-1,180/month.

For comparison, Datadog for 20 hosts: ~$400/month basic plan + APM + logs easily grows to $1,500-2,500/month.

Savings exist, but it's not "free vs paid." Really — 2-3x cheaper with proper setup.

When Prometheus + Grafana Is Profitable

You have Kubernetes or containers. Prometheus was created for this. Native integration, best tooling, all metrics exporters available. For Kubernetes it's de-facto standard.

Have DevOps/SRE resources. Someone must configure, maintain, troubleshoot. If no such person — investment in Prometheus becomes headache.

Need flexibility and customization. Specific metrics, custom exporters, integration with internal systems. Commercial solutions often limit customization.

Budget constrained. Startups, non-profits, small businesses. Difference between $100 and $2,000 per month critical.

Data must stay inside. Compliance, regulations, paranoia. Self-hosted means data under your control.

Lots of data/metrics. With commercial solutions cost grows with volume. Prometheus has no per-metric pricing — only infrastructure costs.

Real Case: Migration from Datadog

Company with 80 EC2 instances paid Datadog $4,200/month. Migrated to Prometheus + Grafana:

Infrastructure: $120/month (dedicated VPS 16GB + S3 storage) Engineering time: ~60 hours initial setup + 6 hours per month maintenance Savings: ~$3,500/month = $42,000/year

Investment paid back in two months.

When Better to Pay for Datadog/New Relic

No DevOps expertise. If nobody to configure and maintain, managed service cheaper than hiring SRE ($150,000+/year).

Need out-of-box functionality. APM with code-level tracing, real user monitoring, synthetic tests. Prometheus doesn't do this natively, need additional tools.

Speed of implementation critical. Datadog works 15 minutes after agent installation. Prometheus requires configuration, dashboards, alerts.

Small infrastructure. For 5-10 servers Datadog free tier or minimal plan ($200-300/month) may be cheaper than engineering time on Prometheus.

Need everything integrated. Datadog supports 600+ integrations, working out of box. Prometheus requires finding/writing exporters.

Multi-cloud with different services. AWS, Azure, GCP, on-premise — all in one dashboard. With Prometheus need to assemble stack from several tools.

Prometheus Limitations: What to Know

Prometheus single-node by design. Not distributed database. This means scalability limitations. One Prometheus instance handles 100,000-200,000 time series, then starts slowing.

Solution: federation (multiple Prometheus instances) or Thanos/Cortex for horizontal scaling. But this adds complexity.

No long-term storage. Prometheus stores data locally on disk, retention usually 15-30 days. For historical data need external storage (Thanos, Cortex, VictoriaMetrics).

Pull-based model can be problem. Prometheus scrapes targets. If target behind NAT or firewall, need workarounds (pushgateway, reverse proxies).

No native logs. Prometheus only metrics. For logs need Loki (Grafana stack) or ELK. Datadog does metrics + logs + traces in one tool.

PromQL more complex than seems. Learning curve steep. Complex queries require understanding how aggregations, time ranges, selectors work.

Scaling Prometheus: From Simple to Complex

Stage 1: Single Instance (up to 100K time series)

One Prometheus server, one Grafana. All on one VPS 8-16GB RAM. Retention 15 days. Simple configuration, easy to maintain.

Suitable for: small deployments, startups, single-cluster Kubernetes.

Stage 2: Federation (100K-500K time series)

Multiple Prometheus instances, each collects data from part of infrastructure. Central Prometheus instance collects aggregated data through federation.

Complexity increases: need to manage multiple instances, configure federation rules.

Stage 3: Thanos or Cortex (500K+ time series)

Distributed long-term storage. Thanos saves data in S3/GCS, provides unlimited retention and global view across Prometheus instances.

Cortex — multi-tenant Prometheus-as-a-service. Horizontal scaling, high availability, long-term storage.

This is already production-grade setup requiring serious operational expertise.

Alternative: Grafana Cloud Managed Prometheus

Grafana Cloud offers managed Prometheus from $29/month. Eliminates operational overhead, but adds usage-based costs.

Pricing: $6.50 per 1,000 active series (low resolution) or $16 (high resolution).

For 50,000 active series: ~$325-800/month. Cheaper than Datadog, but more expensive than self-hosted.

Grafana: Visualization Without Limits

Grafana free and open source. Unlimited dashboards, unlimited users, unlimited data sources.

Grafana Cloud vs Self-Hosted

Self-hosted Grafana: free, but you manage server, updates, backups. For small teams this is trivial.

Grafana Cloud Free tier: 10K metrics series, 50GB logs, 3 users, 14-day retention. Sufficient for testing.

Grafana Cloud Pro: $19/month base + usage ($8 per active user, $6.50 per 1K series).

Enterprise: $299/month + usage. Enterprise plugins, 24/7 support.

Most companies self-host Grafana even if using managed Prometheus. Grafana lightweight and simple in management.

Community Dashboards Save Time

Grafana dashboard library contains thousands of ready dashboards. Node Exporter, Kubernetes, PostgreSQL, Nginx — everything available.

Import dashboard by ID, connect data source — ready. Saves hours on creating dashboards from scratch.

Prometheus Exporters: Metrics for Everything

Exporter collects metrics from system and exposes in Prometheus format. Official exporters exist for:

Node Exporter — system metrics (CPU, RAM, disk, network) Blackbox Exporter — HTTP/TCP/ICMP availability checks MySQL/PostgreSQL Exporters — database metrics Nginx/Apache Exporters — web server stats Kubernetes — built-in, works natively

Community created exporters for hundreds of systems. Don't have needed one? Writing custom exporter in Go or Python takes 2-4 hours.

Migration from Datadog/New Relic: Practical Plan

Preparation (week 1-2)

Inventory what you monitor. Which metrics critical? Which dashboards used daily? Which alerts configured?

Create Prometheus test environment. Don't migrate production immediately. Test on dev/staging.

Configure basic exporters. Node exporter on all hosts, application-specific exporters where needed.

Parallel Run (week 3-4)

Run Prometheus parallel with Datadog. Collect same data in both systems. This allows comparison and ensures nothing missed.

Recreate critical dashboards in Grafana. Start with most important, add rest later.

Configure alerting in Alertmanager. Duplicate alerts that exist in Datadog/New Relic.

Transition (week 5-6)

Switch alerting to Prometheus. But keep Datadog alerts as fallback.

Move team to Grafana dashboards. Train how to use, where to look for data.

Monitor both systems week or two. Look for gaps, missing data, false positives/negatives in alerts.

Old System Shutdown (week 7+)

When confident Prometheus covers everything — disable Datadog/New Relic agents.

Export historical data if needed (Datadog API allows export).

Celebrate cost savings.

Alertmanager: Alerting Without Vendor Lock-In

Prometheus Alertmanager manages alerts. Routing, grouping, silencing, inhibition — all configured through YAML config.

Integrations: Email, Slack, PagerDuty, Opsgenie, webhooks. Everything you need covered.

Advantage: alerts defined in code (Infrastructure as Code). Version controlled, reproducible, peer-reviewed.

Disadvantage: no GUI for creating alerts. Need to write YAML and PromQL queries. Learning curve.

Loki for Logs: Complement to Prometheus

Prometheus — metrics only. For logs add Loki.

Loki created by Grafana Labs as "Prometheus for logs." Similar architecture, label-based indexing, PromQL-like query language (LogQL).

Loki Advantages:

Cheap storage. Indexes only labels, not entire text. S3 storage costs pennies. Grafana integration. Logs and metrics on one dashboard, correlation easy. Kubernetes-friendly. Label extraction from pod metadata automatic.

Disadvantages:

Fewer capabilities than ELK. Full-text search limited, analytics weaker. Smaller community than Elasticsearch.

For most use cases Loki sufficient. For complex log analytics consider ELK.

AGPLv3 Licensing: Important Change

Grafana, Loki, Tempo switched to AGPLv3 in 2021. This is copyleft license requiring code modifications be open sourced.

For most users not a problem — you use software as-is, don't modify.

Problem if: modify code and offer as service. AGPLv3 requires publishing changes.

Prometheus remains Apache 2.0 license (permissive).

Best Practices for Production

Retention and storage planning. Determine how many days retention needed. 15-30 days in Prometheus, long-term in Thanos/S3.

High availability. Run two Prometheus instances collecting same data. If one falls, second works. Alertmanager deduplicates alerts.

Configuration backup. Prometheus config, Grafana dashboards, alerting rules — all in git. Version, make backups.

Monitoring of monitoring. Use external check (e.g., Uptime Robot) to verify Grafana available. Otherwise how know monitoring down?

Resource limits. Configure CPU/RAM limits for Prometheus in Kubernetes. Unbounded memory usage can kill node.

Regular updates. Prometheus and Grafana actively developing. Security patches, new features, bug fixes. Update at least quarterly.

Key Takeaways

Prometheus + Grafana truly cheaper than commercial solutions. But "free" is myth. Real savings 2-3x after accounting for engineering time.

Self-hosted justified when have DevOps expertise, Kubernetes/containers, need flexibility or many metrics. For small teams without expertise managed service may be cheaper total cost.

Operational overhead real. Initial setup 20-40 hours, ongoing maintenance 50-100 hours per year. Budget time or prepare for problems.

Prometheus has limitations. Single-node scaling limits, no native logs, learning curve on PromQL. Know what choosing.

Grafana Cloud — compromise between self-hosted and Datadog. Managed infrastructure, usage-based pricing, but cheaper than commercial alternatives.

Migration from Datadog/New Relic requires planning. Parallel run, recreation dashboards/alerts, team training. Budget 4-8 weeks.

Prometheus + Grafana + Loki stack provides full observability (metrics + logs) for fraction of commercial solutions cost. But requires investment in expertise and infrastructure.

Right choice depends on your situation. Evaluate total cost of ownership honestly — infrastructure + engineering time + opportunity cost. For many organizations self-hosted still more profitable. For others managed service pays off.