」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 實施訂單處理系統:零件監控與警報

實施訂單處理系統:零件監控與警報

發佈於2024-11-09
瀏覽:327

Implementing an Order Processing System: Part  Monitoring and Alerting

1. Introduction and Goals

Welcome to the fourth installment of our series on implementing a sophisticated order processing system! In our previous posts, we laid the foundation for our project, explored advanced Temporal workflows, and delved into advanced database operations. Today, we’re focusing on an equally crucial aspect of any production-ready system: monitoring and alerting.

Recap of Previous Posts

  1. In Part 1, we set up our project structure and implemented a basic CRUD API.
  2. In Part 2, we expanded our use of Temporal, implementing complex workflows and exploring advanced concepts.
  3. In Part 3, we focused on advanced database operations, including optimization, sharding, and ensuring consistency in distributed systems.

Importance of Monitoring and Alerting in Microservices Architecture

In a microservices architecture, especially one handling complex processes like order management, effective monitoring and alerting are crucial. They allow us to:

  1. Understand the behavior and performance of our system in real-time
  2. Quickly identify and diagnose issues before they impact users
  3. Make data-driven decisions for scaling and optimization
  4. Ensure the reliability and availability of our services

Overview of Prometheus and its Ecosystem

Prometheus is an open-source systems monitoring and alerting toolkit. It’s become a standard in the cloud-native world due to its powerful features and extensive ecosystem. Key components include:

  1. Prometheus Server : Scrapes and stores time series data
  2. Client Libraries : Allow easy instrumentation of application code
  3. Alertmanager : Handles alerts from Prometheus server
  4. Pushgateway : Allows ephemeral and batch jobs to expose metrics
  5. Exporters : Allow third-party systems to expose metrics to Prometheus

We’ll also be using Grafana, a popular open-source platform for monitoring and observability, to create dashboards and visualize our Prometheus data.

Goals for this Part of the Series

By the end of this post, you’ll be able to:

  1. Set up Prometheus to monitor our order processing system
  2. Implement custom metrics in our Go services
  3. Create informative dashboards using Grafana
  4. Set up alerting rules to notify us of potential issues
  5. Monitor database performance and Temporal workflows effectively

Let’s dive in!

2. Theoretical Background and Concepts

Before we start implementing, let’s review some key concepts that will be crucial for our monitoring and alerting setup.

Observability in Distributed Systems

Observability refers to the ability to understand the internal state of a system by examining its outputs. In distributed systems like our order processing system, observability typically encompasses three main pillars:

  1. Metrics : Numerical representations of data measured over intervals of time
  2. Logs : Detailed records of discrete events within the system
  3. Traces : Representations of causal chains of events across components

In this post, we’ll focus primarily on metrics, though we’ll touch on how these can be integrated with logs and traces.

Prometheus Architecture

Prometheus follows a pull-based architecture:

  1. Data Collection : Prometheus scrapes metrics from instrumented jobs via HTTP
  2. Data Storage : Metrics are stored in a time-series database on the local storage
  3. Querying : PromQL allows flexible querying of this data
  4. Alerting : Prometheus can trigger alerts based on query results
  5. Visualization : While Prometheus has a basic UI, it’s often paired with Grafana for richer visualizations

Metrics Types in Prometheus

Prometheus offers four core metric types:

  1. Counter : A cumulative metric that only goes up (e.g., number of requests processed)
  2. Gauge : A metric that can go up and down (e.g., current memory usage)
  3. Histogram : Samples observations and counts them in configurable buckets (e.g., request durations)
  4. Summary : Similar to histogram, but calculates configurable quantiles over a sliding time window

Introduction to PromQL

PromQL (Prometheus Query Language) is a powerful functional language for querying Prometheus data. It allows you to select and aggregate time series data in real time. Key features include:

  • Instant vector selectors
  • Range vector selectors
  • Offset modifier
  • Aggregation operators
  • Binary operators

We’ll see examples of PromQL queries as we build our dashboards and alerts.

Overview of Grafana

Grafana is a multi-platform open source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources, of which Prometheus is one. Key features include:

  • Flexible dashboard creation
  • Wide range of visualization options
  • Alerting capabilities
  • User authentication and authorization
  • Plugin system for extensibility

Now that we’ve covered these concepts, let’s start implementing our monitoring and alerting system.

3. Setting Up Prometheus for Our Order Processing System

Let’s begin by setting up Prometheus to monitor our order processing system.

Installing and Configuring Prometheus

First, let’s add Prometheus to our docker-compose.yml file:

services:
  # ... other services ...

  prometheus:
    image: prom/prometheus:v2.30.3
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090

volumes:
  # ... other volumes ...
  prometheus_data: {}

Next, create a prometheus.yml file in the ./prometheus directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'order_processing_api'
    static_configs:
      - targets: ['order_processing_api:8080']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']

This configuration tells Prometheus to scrape metrics from itself, our order processing API, and a Postgres exporter (which we’ll set up later).

Implementing Prometheus Exporters for Our Go Services

To expose metrics from our Go services, we’ll use the Prometheus client library. First, add it to your go.mod:

go get github.com/prometheus/client_golang

Now, let’s modify our main Go file to expose metrics:

package main

import (
    "net/http"

    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    r := gin.Default()

    // Middleware to record metrics
    r.Use(func(c *gin.Context) {
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(c.Request.Method, c.FullPath()))
        c.Next()
        timer.ObserveDuration()
        httpRequestsTotal.WithLabelValues(c.Request.Method, c.FullPath(), string(c.Writer.Status())).Inc()
    })

    // Expose metrics endpoint
    r.GET("/metrics", gin.WrapH(promhttp.Handler()))

    // ... rest of your routes ...

    r.Run(":8080")
}

This code sets up two metrics:

  1. http_requests_total: A counter that tracks the total number of HTTP requests
  2. http_request_duration_seconds: A histogram that tracks the duration of HTTP requests

Setting Up Service Discovery for Dynamic Environments

For more dynamic environments, Prometheus supports various service discovery mechanisms. For example, if you’re running on Kubernetes, you might use the Kubernetes SD configuration:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (. )

This configuration will automatically discover and scrape metrics from pods with the appropriate annotations.

Configuring Retention and Storage for Prometheus Data

Prometheus stores data in a time-series database on the local filesystem. You can configure retention time and storage size in the Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

# ... rest of the configuration ...

This configuration sets a retention period of 15 days and a maximum storage size of 50GB.

In the next section, we’ll dive into defining and implementing custom metrics for our order processing system.

4. Defining and Implementing Custom Metrics

Now that we have Prometheus set up and basic HTTP metrics implemented, let’s define and implement custom metrics specific to our order processing system.

Designing a Metrics Schema for Our Order Processing System

When designing metrics, it’s important to think about what insights we want to gain from our system. For our order processing system, we might want to track:

  1. Order creation rate
  2. Order processing time
  3. Order status distribution
  4. Payment processing success/failure rate
  5. Inventory update operations
  6. Shipping arrangement time

Let’s implement these metrics:

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    OrdersCreated = promauto.NewCounter(prometheus.CounterOpts{
        Name: "orders_created_total",
        Help: "The total number of created orders",
    })

    OrderProcessingTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "order_processing_seconds",
        Help: "Time taken to process an order",
        Buckets: prometheus.LinearBuckets(0, 30, 10), // 0-300 seconds, 30-second buckets
    })

    OrderStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "orders_by_status",
        Help: "Number of orders by status",
    }, []string{"status"})

    PaymentProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "payments_processed_total",
        Help: "The total number of processed payments",
    }, []string{"status"})

    InventoryUpdates = promauto.NewCounter(prometheus.CounterOpts{
        Name: "inventory_updates_total",
        Help: "The total number of inventory updates",
    })

    ShippingArrangementTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "shipping_arrangement_seconds",
        Help: "Time taken to arrange shipping",
        Buckets: prometheus.LinearBuckets(0, 60, 5), // 0-300 seconds, 60-second buckets
    })
)

Implementing Application-Specific Metrics in Our Go Services

Now that we’ve defined our metrics, let’s implement them in our service:

package main

import (
    "time"

    "github.com/yourusername/order-processing-system/metrics"
)

func createOrder(order Order) error {
    startTime := time.Now()

    // Order creation logic...

    metrics.OrdersCreated.Inc()
    metrics.OrderProcessingTime.Observe(time.Since(startTime).Seconds())
    metrics.OrderStatusGauge.WithLabelValues("pending").Inc()

    return nil
}

func processPayment(payment Payment) error {
    // Payment processing logic...

    if paymentSuccessful {
        metrics.PaymentProcessed.WithLabelValues("success").Inc()
    } else {
        metrics.PaymentProcessed.WithLabelValues("failure").Inc()
    }

    return nil
}

func updateInventory(item Item) error {
    // Inventory update logic...

    metrics.InventoryUpdates.Inc()

    return nil
}

func arrangeShipping(order Order) error {
    startTime := time.Now()

    // Shipping arrangement logic...

    metrics.ShippingArrangementTime.Observe(time.Since(startTime).Seconds())

    return nil
}

Best Practices for Naming and Labeling Metrics

When naming and labeling metrics, consider these best practices:

  1. Use a consistent naming scheme (e.g., __)
  2. Use clear, descriptive names
  3. Include units in the metric name (e.g., _seconds, _bytes)
  4. Use labels to differentiate instances of a metric, but be cautious of high cardinality
  5. Keep the number of labels manageable

Instrumenting Key Components: API Endpoints, Database Operations, Temporal Workflows

For API endpoints, we’ve already implemented basic instrumentation. For database operations, we can add metrics like this:

func (s *Store) GetOrder(ctx context.Context, id int64) (Order, error) {
    startTime := time.Now()
    defer func() {
        metrics.DBOperationDuration.WithLabelValues("GetOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing GetOrder logic...
}

For Temporal workflows, we can add metrics in our activity implementations:

func ProcessOrderActivity(ctx context.Context, order Order) error {
    startTime := time.Now()
    defer func() {
        metrics.WorkflowActivityDuration.WithLabelValues("ProcessOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing ProcessOrder logic...
}

5. Creating Dashboards with Grafana

Now that we have our metrics set up, let’s visualize them using Grafana.

Installing and Configuring Grafana

First, let’s add Grafana to our docker-compose.yml:

services:
  # ... other services ...

  grafana:
    image: grafana/grafana:8.2.2
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  # ... other volumes ...
  grafana_data: {}

Connecting Grafana to Our Prometheus Data Source

  1. Access Grafana at http://localhost:3000 (default credentials are admin/admin)
  2. Go to Configuration > Data Sources
  3. Click “Add data source” and select Prometheus
  4. Set the URL to http://prometheus:9090 (this is the Docker service name)
  5. Click “Save & Test”

Designing Effective Dashboards for Our Order Processing System

Let’s create a dashboard for our order processing system:

  1. Click “Create” > “Dashboard”
  2. Add a new panel

For our first panel, let’s create a graph of order creation rate:

  1. In the query editor, enter: rate(orders_created_total[5m])
  2. Set the panel title to “Order Creation Rate”
  3. Under Settings, set the unit to “orders/second”

Let’s add another panel for order processing time:

  1. Add a new panel
  2. Query: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m]))
  3. Title: “95th Percentile Order Processing Time”
  4. Unit: “seconds”

For order status distribution:

  1. Add a new panel
  2. Query: orders_by_status
  3. Visualization: Pie Chart
  4. Title: “Order Status Distribution”

Continue adding panels for other metrics we’ve defined.

Implementing Variable Templating for Flexible Dashboards

Grafana allows us to create variables that can be used across the dashboard. Let’s create a variable for time range:

  1. Go to Dashboard Settings > Variables
  2. Click “Add variable”
  3. Name: time_range
  4. Type: Interval
  5. Values: 5m,15m,30m,1h,6h,12h,24h,7d

Now we can use this in our queries like this: rate(orders_created_total[$time_range])

Best Practices for Dashboard Design and Organization

  1. Group related panels together
  2. Use consistent color schemes
  3. Include a description for each panel
  4. Use appropriate visualizations for each metric type
  5. Consider creating separate dashboards for different aspects of the system (e.g., Orders, Inventory, Shipping)

In the next section, we’ll set up alerting rules to notify us of potential issues in our system.

6. Implementing Alerting Rules

Now that we have our metrics and dashboards set up, let’s implement alerting to proactively notify us of potential issues in our system.

Designing an Alerting Strategy for Our System

When designing alerts, consider the following principles:

  1. Alert on symptoms, not causes
  2. Ensure alerts are actionable
  3. Avoid alert fatigue by only alerting on critical issues
  4. Use different severity levels for different types of issues

For our order processing system, we might want to alert on:

  1. High error rate in order processing
  2. Slow order processing time
  3. Unusual spike or drop in order creation rate
  4. Low inventory levels
  5. High rate of payment failures

Implementing Prometheus Alerting Rules

Let’s create an alerts.yml file in our Prometheus configuration directory:

groups:
- name: order_processing_alerts
  rules:
  - alert: HighOrderProcessingErrorRate
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High order processing error rate
      description: "Error rate is over the last 5 minutes"

  - alert: SlowOrderProcessing
    expr: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m])) > 300
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Slow order processing
      description: "95th percentile of order processing time is over the last 5 minutes"

  - alert: UnusualOrderRate
    expr: abs(rate(orders_created_total[1h]) - rate(orders_created_total[1h] offset 1d)) > (rate(orders_created_total[1h] offset 1d) * 0.3)
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Unusual order creation rate
      description: "Order creation rate has changed by more than 30% compared to the same time yesterday"

  - alert: LowInventory
    expr: inventory_level  0.1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High payment failure rate
      description: "Payment failure rate is over the last 15 minutes"

Update your prometheus.yml to include this alerts file:

rule_files:
  - "alerts.yml"

Setting Up Alertmanager for Alert Routing and Grouping

Now, let’s set up Alertmanager to handle our alerts. Add Alertmanager to your docker-compose.yml:

services:
  # ... other services ...

  alertmanager:
    image: prom/alertmanager:v0.23.0
    ports:
      - 9093:9093
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

Create an alertmanager.yml in the ./alertmanager directory:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: '[email protected]'
    from: '[email protected]'
    smarthost: 'smtp.example.com:587'
    auth_username: '[email protected]'
    auth_identity: '[email protected]'
    auth_password: 'password'

Update your prometheus.yml to point to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Configuring Notification Channels

In the Alertmanager configuration above, we’ve set up email notifications. You can also configure other channels like Slack, PagerDuty, or custom webhooks.

Implementing Alert Severity Levels and Escalation Policies

In our alerts, we’ve used severity labels. We can use these in Alertmanager to implement different routing or notification strategies based on severity:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: '[email protected]'
- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: ''
- name: 'slack-warnings'
  slack_configs:
  - api_url: ''
    channel: '#alerts'

7. Monitoring Database Performance

Monitoring database performance is crucial for maintaining a responsive and reliable system. Let’s set up monitoring for our PostgreSQL database.

Implementing the Postgres Exporter for Prometheus

First, add the Postgres exporter to your docker-compose.yml:

services:
  # ... other services ...

  postgres_exporter:
    image: wrouesnel/postgres_exporter:latest
    environment:
      DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
    ports:
      - 9187:9187

Make sure to replace user, password, and dbname with your actual PostgreSQL credentials.

Key Metrics to Monitor for Postgres Performance

Some important PostgreSQL metrics to monitor include:

  1. Number of active connections
  2. Database size
  3. Query execution time
  4. Cache hit ratio
  5. Replication lag (if using replication)
  6. Transaction rate
  7. Tuple operations (inserts, updates, deletes)

Creating a Database Performance Dashboard in Grafana

Let’s create a new dashboard for database performance:

  1. Create a new dashboard in Grafana
  2. Add a panel for active connections:
    • Query: pg_stat_activity_count{datname="your_database_name"}
    • Title: “Active Connections”
  3. Add a panel for database size:
    • Query: pg_database_size_bytes{datname="your_database_name"}
    • Title: “Database Size”
    • Unit: bytes(IEC)
  4. Add a panel for query execution time:
    • Query: rate(pg_stat_database_xact_commit{datname="your_database_name"}[5m]) rate(pg_stat_database_xact_rollback{datname="your_database_name"}[5m])
    • Title: “Transactions per Second”
  5. Add a panel for cache hit ratio:
    • Query: pg_stat_database_blks_hit{datname="your_database_name"} / (pg_stat_database_blks_hit{datname="your_database_name"} pg_stat_database_blks_read{datname="your_database_name"})
    • Title: “Cache Hit Ratio”

Setting Up Alerts for Database Issues

Let’s add some database-specific alerts to our alerts.yml:

  - alert: HighDatabaseConnections
    expr: pg_stat_activity_count > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High number of database connections
      description: "There are active database connections"

  - alert: LowCacheHitRatio
    expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit   pg_stat_database_blks_read) 



8. Monitoring Temporal Workflows

Monitoring Temporal workflows is essential for ensuring the reliability and performance of our order processing system.

Implementing Temporal Metrics in Our Go Services

Temporal provides a metrics client that we can use to expose metrics to Prometheus. Let’s update our Temporal worker to include metrics:

import (
    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
    "go.temporal.io/sdk/contrib/prometheus"
)

func main() {
    // ... other setup ...

    // Create Prometheus metrics handler
    metricsHandler := prometheus.NewPrometheusMetricsHandler()

    // Create Temporal client with metrics
    c, err := client.NewClient(client.Options{
        MetricsHandler: metricsHandler,
    })
    if err != nil {
        log.Fatalln("Unable to create Temporal client", err)
    }
    defer c.Close()

    // Create worker with metrics
    w := worker.New(c, "order-processing-task-queue", worker.Options{
        MetricsHandler: metricsHandler,
    })

    // ... register workflows and activities ...

    // Run the worker
    err = w.Run(worker.InterruptCh())
    if err != nil {
        log.Fatalln("Unable to start worker", err)
    }
}

Key Metrics to Monitor for Temporal Workflows

Important Temporal metrics to monitor include:

  1. Workflow start rate
  2. Workflow completion rate
  3. Workflow execution time
  4. Activity success/failure rate
  5. Activity execution time
  6. Task queue latency

Creating a Temporal Workflow Dashboard in Grafana

Let’s create a dashboard for Temporal workflows:

  1. Create a new dashboard in Grafana
  2. Add a panel for workflow start rate:
    • Query: rate(temporal_workflow_start_total[5m])
    • Title: “Workflow Start Rate”
  3. Add a panel for workflow completion rate:
    • Query: rate(temporal_workflow_completed_total[5m])
    • Title: “Workflow Completion Rate”
  4. Add a panel for workflow execution time:
    • Query: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[5m]))
    • Title: “95th Percentile Workflow Execution Time”
    • Unit: seconds
  5. Add a panel for activity success rate:
    • Query: rate(temporal_activity_success_total[5m]) / (rate(temporal_activity_success_total[5m]) rate(temporal_activity_fail_total[5m]))
    • Title: “Activity Success Rate”

Setting Up Alerts for Workflow Issues

Let’s add some Temporal-specific alerts to our alerts.yml:

  - alert: HighWorkflowFailureRate
    expr: rate(temporal_workflow_failed_total[15m]) / rate(temporal_workflow_completed_total[15m]) > 0.05
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High workflow failure rate
      description: "Workflow failure rate is over the last 15 minutes"

  - alert: LongRunningWorkflow
    expr: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[1h])) > 3600
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Long-running workflows detected
      description: "95th percentile of workflow execution time is over 1 hour"

These alerts will help you detect issues with your Temporal workflows, such as high failure rates or unexpectedly long-running workflows.

In the next sections, we’ll cover some advanced Prometheus techniques and discuss testing and validation of our monitoring setup.

9. Advanced Prometheus Techniques

As our monitoring system grows more complex, we can leverage some advanced Prometheus techniques to improve its efficiency and capabilities.

Using Recording Rules for Complex Queries and Aggregations

Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. This can significantly speed up the evaluation of dashboards and alerts.

Let’s add some recording rules to our Prometheus configuration. Create a rules.yml file:

groups:
- name: example_recording_rules
  interval: 5m
  rules:
  - record: job:order_processing_rate:5m
    expr: rate(orders_created_total[5m])

  - record: job:order_processing_error_rate:5m
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m])

  - record: job:payment_success_rate:5m
    expr: rate(payments_processed_total{status="success"}[5m]) / rate(payments_processed_total[5m])

Add this file to your Prometheus configuration:

rule_files:
  - "alerts.yml"
  - "rules.yml"

Now you can use these precomputed metrics in your dashboards and alerts, which can be especially helpful for complex queries that you use frequently.

Implementing Push Gateway for Batch Jobs and Short-Lived Processes

The Pushgateway allows you to push metrics from jobs that can’t be scraped, such as batch jobs or serverless functions. Let’s add a Pushgateway to our docker-compose.yml:

services:
  # ... other services ...

  pushgateway:
    image: prom/pushgateway
    ports:
      - 9091:9091

Now, you can push metrics to the Pushgateway from your batch jobs or short-lived processes. Here’s an example using the Go client:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/push"
)

func runBatchJob() {
    // Define a counter for the batch job
    batchJobCounter := prometheus.NewCounter(prometheus.CounterOpts{
        Name: "batch_job_processed_total",
        Help: "Total number of items processed by the batch job",
    })

    // Run your batch job and update the counter
    // ...

    // Push the metric to the Pushgateway
    pusher := push.New("http://pushgateway:9091", "batch_job")
    pusher.Collector(batchJobCounter)
    if err := pusher.Push(); err != nil {
        log.Printf("Could not push to Pushgateway: %v", err)
    }
}

Don’t forget to add the Pushgateway as a target in your Prometheus configuration:

scrape_configs:
  # ... other configs ...

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']

Federated Prometheus Setups for Large-Scale Systems

For large-scale systems, you might need to set up Prometheus federation, where one Prometheus server scrapes data from other Prometheus servers. This allows you to aggregate metrics from multiple Prometheus instances.

Here’s an example configuration for a federated Prometheus setup:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="order_processing_api"}'
        - '{job="postgres_exporter"}'
    static_configs:
      - targets:
        - 'prometheus-1:9090'
        - 'prometheus-2:9090'

This configuration allows a higher-level Prometheus server to scrape specific metrics from other Prometheus servers.

Using Exemplars for Tracing Integration

Exemplars allow you to link metrics to trace data, providing a way to drill down from a high-level metric to a specific trace. This is particularly useful when integrating Prometheus with distributed tracing systems like Jaeger or Zipkin.

To use exemplars, you need to enable them in your Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  exemplar_storage:
    enable: true

Then, when instrumenting your code, you can add exemplars to your metrics:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    orderProcessingDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "order_processing_duration_seconds",
            Help: "Duration of order processing in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"status"},
    )
)

func processOrder(order Order) {
    start := time.Now()
    // Process the order...
    duration := time.Since(start)

    orderProcessingDuration.WithLabelValues(order.Status).Observe(duration.Seconds(),
        prometheus.Labels{
            "traceID": getCurrentTraceID(),
        },
    )
}

This allows you to link from a spike in order processing duration directly to the trace of a slow order, greatly aiding in debugging and performance analysis.

10. Testing and Validation

Ensuring the reliability of your monitoring system is crucial. Let’s explore some strategies for testing and validating our Prometheus setup.

Unit Testing Metric Instrumentation

When unit testing your Go code, you can use the prometheus/testutil package to verify that your metrics are being updated correctly:

import (
    "testing"

    "github.com/prometheus/client_golang/prometheus/testutil"
)

func TestOrderProcessing(t *testing.T) {
    // Process an order
    processOrder(Order{ID: 1, Status: "completed"})

    // Check if the metric was updated
    expected := `
        # HELP order_processing_duration_seconds Duration of order processing in seconds
        # TYPE order_processing_duration_seconds histogram
        order_processing_duration_seconds_bucket{status="completed",le="0.005"} 1
        order_processing_duration_seconds_bucket{status="completed",le="0.01"} 1
        # ... other buckets ...
        order_processing_duration_seconds_sum{status="completed"} 0.001
        order_processing_duration_seconds_count{status="completed"} 1
    `
    if err := testutil.CollectAndCompare(orderProcessingDuration, strings.NewReader(expected)); err != nil {
        t.Errorf("unexpected collecting result:\n%s", err)
    }
}

Integration Testing for Prometheus Scraping

To test that Prometheus is correctly scraping your metrics, you can set up an integration test that starts your application, waits for Prometheus to scrape it, and then queries Prometheus to verify the metrics:

func TestPrometheusIntegration(t *testing.T) {
    // Start your application
    go startApp()

    // Wait for Prometheus to scrape (adjust the sleep time as needed)
    time.Sleep(30 * time.Second)

    // Query Prometheus
    client, err := api.NewClient(api.Config{
        Address: "http://localhost:9090",
    })
    if err != nil {
        t.Fatalf("Error creating client: %v", err)
    }

    v1api := v1.NewAPI(client)
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    result, warnings, err := v1api.Query(ctx, "order_processing_duration_seconds_count", time.Now())
    if err != nil {
        t.Fatalf("Error querying Prometheus: %v", err)
    }
    if len(warnings) > 0 {
        t.Logf("Warnings: %v", warnings)
    }

    // Check the result
    if result.(model.Vector).Len() == 0 {
        t.Errorf("Expected non-empty result")
    }
}

Load Testing and Observing Metrics Under Stress

It’s important to verify that your monitoring system performs well under load. You can use tools like hey or vegeta to generate load on your system while observing your metrics:

hey -n 10000 -c 100 http://localhost:8080/orders

While the load test is running, observe your Grafana dashboards and check that your metrics are updating as expected and that Prometheus is able to keep up with the increased load.

Validating Alerting Rules and Notification Channels

To test your alerting rules, you can temporarily adjust the thresholds to trigger alerts, or use Prometheus’s API to manually fire alerts:

curl -H "Content-Type: application/json" -d '{
  "alerts": [
    {
      "labels": {
        "alertname": "HighOrderProcessingErrorRate",
        "severity": "critical"
      },
      "annotations": {
        "summary": "High order processing error rate"
      }
    }
  ]
}' http://localhost:9093/api/v1/alerts

This will send a test alert to your Alertmanager, allowing you to verify that your notification channels are working correctly.

11. Challenges and Considerations

As you implement and scale your monitoring system, keep these challenges and considerations in mind:

Managing Cardinality in High-Dimensional Data

High cardinality can lead to performance issues in Prometheus. Be cautious when adding labels to metrics, especially labels with many possible values (like user IDs or IP addresses). Instead, consider using histogram metrics or reducing the cardinality by grouping similar values.

Scaling Prometheus for Large-Scale Systems

For large-scale systems, consider:

  • Using the Pushgateway for batch jobs
  • Implementing federation for large-scale setups
  • Using remote storage solutions for long-term storage of metrics

Ensuring Monitoring System Reliability and Availability

Your monitoring system is critical infrastructure. Consider:

  • Implementing high availability for Prometheus and Alertmanager
  • Monitoring your monitoring system (meta-monitoring)
  • Regularly backing up your Prometheus data

Security Considerations for Metrics and Alerting

Ensure that:

  • Access to Prometheus and Grafana is properly secured
  • Sensitive information is not exposed in metrics or alerts
  • TLS is used for all communications in your monitoring stack

Dealing with Transient Issues and Flapping Alerts

To reduce alert noise:

  • Use appropriate time windows in your alert rules
  • Implement alert grouping in Alertmanager
  • Consider using alert inhibition for related alerts

12. Next Steps and Preview of Part 5

In this post, we’ve covered comprehensive monitoring and alerting for our order processing system using Prometheus and Grafana. We’ve set up custom metrics, created informative dashboards, implemented alerting, and explored advanced techniques and considerations.

In the next part of our series, we’ll focus on distributed tracing and logging. We’ll cover:

  1. Implementing distributed tracing with OpenTelemetry
  2. Setting up centralized logging with the ELK stack
  3. Correlating logs, traces, and metrics for effective debugging
  4. Implementing log aggregation and analysis
  5. Best practices for logging in a microservices architecture

Stay tuned as we continue to enhance our order processing system, focusing next on gaining deeper insights into our distributed system’s behavior and performance!


Need Help?

Are you facing challenging problems, or need an external perspective on a new idea or project? I can help! Whether you're looking to build a technology proof of concept before making a larger investment, or you need guidance on difficult issues, I'm here to assist.

Services Offered:

  • Problem-Solving: Tackling complex issues with innovative solutions.
  • Consultation: Providing expert advice and fresh viewpoints on your projects.
  • Proof of Concept: Developing preliminary models to test and validate your ideas.

If you're interested in working with me, please reach out via email at [email protected].

Let's turn your challenges into opportunities!

版本聲明 本文轉載於:https://dev.to/hungai/implementing-an-order-processing-system-part-4-monitoring-and-alerting-1lfo?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • 如何在 Go (Gorilla) 中向特定客戶端發送有針對性的 Websocket 更新?
    如何在 Go (Gorilla) 中向特定客戶端發送有針對性的 Websocket 更新?
    在Go (Gorilla) 中向特定客戶端發送Websocket 更新儘管是Go 新手,但您尋求有關實現Websocket 通信的指導您的預輸入項目。您已嘗試利用 Gorilla 的 GitHub 儲存庫中的範例,但在理解如何識別特定客戶端並針對 websocket 更新進行定位方面遇到了挑戰。 要...
    程式設計 發佈於2024-11-09
  • 大批
    大批
    方法是可以在物件上呼叫的 fns 數組是對象,因此它們在 JS 中也有方法。 slice(begin):將陣列的一部分提取到新數組中,而不改變原始數組。 let arr = ['a','b','c','d','e']; // Usecase: Extract till index ...
    程式設計 發佈於2024-11-09
  • 使用swoole作為基於ESP6的腳本可程式控制器的雲端物聯網閘道框架
    使用swoole作為基於ESP6的腳本可程式控制器的雲端物聯網閘道框架
    腳本可程式控制器的本機功能基本上已完成,開始實現遠端相關功能。 遠端系統整體架構如下: 使用ESP8266的SDK實作tcp伺服器和tcp客戶端。 在tcp伺服器的基礎上編寫http協議解析程式碼,設計簡單的http伺服器,處理與瀏覽器的資料交互,包括內建網頁的下載,並使用ajax技術獲取狀態並...
    程式設計 發佈於2024-11-09
  • Bootstrap 4 Beta 中的列偏移發生了什麼事?
    Bootstrap 4 Beta 中的列偏移發生了什麼事?
    Bootstrap 4 Beta:列偏移的刪除和恢復Bootstrap 4 在其Beta 1 版本中引入了重大更改柱子偏移了。然而,隨著 Beta 2 的後續發布,這些變化已經逆轉。 從 offset-md-* 到 ml-auto在 Bootstrap 4 Beta 1 中, offset-md-*...
    程式設計 發佈於2024-11-09
  • 為什麼在 Java 的 Random 類別中設定種子會傳回相同的數字?
    為什麼在 Java 的 Random 類別中設定種子會傳回相同的數字?
    Java隨機數產生:為什麼設定種子會回傳相同的數字? 儘管將Random類別的種子設定為特定值,但隨機數產生器始終會傳回相同的數字。讓我們探討一下可能導致此問題的原因。 了解 Random 類別和種子初始化Java Random 類別旨在產生偽隨機數。預設情況下,它使用其內部時鐘作為種子值,使其產生...
    程式設計 發佈於2024-11-09
  • 在 Go 中使用 WebSocket 進行即時通信
    在 Go 中使用 WebSocket 進行即時通信
    构建需要实时更新的应用程序(例如聊天应用程序、实时通知或协作工具)需要一种比传统 HTTP 更快、更具交互性的通信方法。这就是 WebSockets 发挥作用的地方!今天,我们将探讨如何在 Go 中使用 WebSocket,以便您可以向应用程序添加实时功能。 在这篇文章中,我们将介绍: WebSoc...
    程式設計 發佈於2024-11-09
  • 如何克服使用反射設定結構體欄位值時 SetCan() 總是傳回 False 的問題?
    如何克服使用反射設定結構體欄位值時 SetCan() 總是傳回 False 的問題?
    使用結構體的 SetString 探索反射反射提供了動態操作 Go 結構的強大工具。在此範例中,我們在嘗試使用反射來設定結構體欄位的值時遇到一個常見問題:CanSet() 始終傳回 false。這種障礙阻止了字段修改,使我們陷入困境。 識別陷阱提供的程式碼片段突顯了兩個基本錯誤:傳遞值而非指標: ...
    程式設計 發佈於2024-11-09
  • 為什麼 MySQL 中帶有子查詢的「IN」查詢很慢,如何提升效能?
    為什麼 MySQL 中帶有子查詢的「IN」查詢很慢,如何提升效能?
    MySQL 中帶有子查詢的緩慢「IN」查詢當使用子查詢時,使用「IN」運算子的MySQL查詢可能會表現出顯著的效能下降檢索「IN」子句的值很複雜。在這種情況下,用明確值取代子查詢結果會顯著縮短執行時間。 要了解此行為的原因,需要注意的是,每次評估「IN」查詢時,MySQL 都會執行子查詢。在提供的範...
    程式設計 發佈於2024-11-09
  • 如何使用WinAPI取得螢幕解析度?
    如何使用WinAPI取得螢幕解析度?
    使用 WinAPI 取得螢幕解析度在 WinAPI 中,存在多個函數來決定目前螢幕解析度。適當的選擇取決於具體要求。 檢索顯示尺寸檢索顯示尺寸檢索顯示尺寸 主監視器:使用GetSystemMetrics(SM_CXSCREEN) 和GetSystemMetrics( SM_CYCYSEN) 取得主顯...
    程式設計 發佈於2024-11-09
  • 如何修復透過 Gmail REST API 發送電子郵件時出現的「400 錯誤請求 + 失敗前提條件」錯誤?
    如何修復透過 Gmail REST API 發送電子郵件時出現的「400 錯誤請求 + 失敗前提條件」錯誤?
    Gmail REST API:解決“400 Bad Request Failed Precondition”錯誤嘗試使用Gmail REST API 與伺服器發送電子郵件時-到伺服器授權時,您可能會遇到一條錯誤訊息,指出「400 Bad Request Failed Precondition」。此錯...
    程式設計 發佈於2024-11-09
  • 如何使用 LOAD XML 和 XML_LOAD() 將缺少 ID 列的 XML 檔案匯入 MySQL?
    如何使用 LOAD XML 和 XML_LOAD() 將缺少 ID 列的 XML 檔案匯入 MySQL?
    使用XML_LOAD() 函數將XML 檔案匯入MySQL在這種情況下,您在嘗試使用下列命令將XML 檔案匯入MySQL 資料庫表時遇到錯誤載入XML 指令。出現此問題的原因是表中的欄位數與 XML 檔案中的值不匹配,且表中多了一個自動遞增 id 欄位。 要解決此錯誤,您可以指定要使用 LOAD X...
    程式設計 發佈於2024-11-09
  • C++ 物件的記憶體是如何組織的?
    C++ 物件的記憶體是如何組織的?
    C 物件的記憶體佈局動態轉換和重新解釋操作通常涉及操作物件記憶體指標。讓我們深入研究 C 如何在記憶體中組織對象,以便更好地理解這些操作。 根據 C 標準,類別或結構中非靜態資料成員的記憶體佈局主要由它們的宣告順序決定。具有相同存取說明符的成員按其聲明的順序排序。 除了成員變數之外,物件還為以下物件...
    程式設計 發佈於2024-11-09
  • 時間數據系列:故事的其餘部分
    時間數據系列:故事的其餘部分
    Time Data Series: The Rest of the Story - AdatoSystems It’s been a while since I wrote about PHP Zmanim – the work I’ve done with it and the ...
    程式設計 發佈於2024-11-09
  • 如何在 PyCharm 中輕鬆管理 Django 專案的環境變數?
    如何在 PyCharm 中輕鬆管理 Django 專案的環境變數?
    在PyCharm 中設定環境變數在處理依賴環境變數的專案時,擁有方便的方法直接在開發環境中管理這些設定至關重要。在本指南中,我們將示範如何在 PyCharm 中輕鬆設定環境變量,而無需求助於手動配置或 bash 檔案。 具體來說,我們將重點放在為Django 專案設定以下環境變數:DATABASE_...
    程式設計 發佈於2024-11-09
  • 如何修復 Windows 上的「pip install」存取被拒絕錯誤?
    如何修復 Windows 上的「pip install」存取被拒絕錯誤?
    克服Windows 上的「pip install」存取被拒絕錯誤使用pip 安裝可能是一項簡單的任務,但有時您可能會遇到存取錯誤在Windows 上出現拒絕錯誤,即使以管理員身分執行命令提示字元或PowerShell 也是如此。 此錯誤通常表現為:WindowsError: [Error 5] Ac...
    程式設計 發佈於2024-11-09

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3