Practical Monitoring and Logging - From Scratch

author

By Freecoderteam

Oct 04, 2025

1

image

Practical Monitoring and Logging: From Scratch

Monitoring and logging are critical components of modern software development and infrastructure management. They provide insights into system behavior, help identify issues, and ensure that applications remain stable and performant. In this comprehensive guide, we'll explore practical monitoring and logging techniques, starting from the basics and progressing to actionable insights. Whether you're a developer, DevOps engineer, or system administrator, this post will equip you with the knowledge to build an effective monitoring and logging pipeline.


Table of Contents

  1. Introduction to Monitoring and Logging
  2. Key Concepts
    • Metrics
    • Logs
    • Tracing
  3. Choosing the Right Tools
    • Open-source vs. Commercial Solutions
    • Common Tools
  4. Setting Up Logging
    • Log Format and Structure
    • Log Levels
    • Example: Writing Logs in Python
  5. Configuring Monitoring
    • Metrics Collection
    • Alerting and Notifications
    • Example: Monitoring a Web Server
  6. Practical Insights and Best Practices
    • Centralized Logging
    • Monitoring for SRE (Site Reliability Engineering)
    • Security Considerations
  7. Conclusion
  8. Further Reading

1. Introduction to Monitoring and Logging

Monitoring and logging are essential for understanding and managing the health of your systems. While they are closely related, they serve distinct purposes:

  • Monitoring involves collecting and analyzing metrics (e.g., CPU usage, memory consumption, response times) to assess system performance and detect anomalies.
  • Logging captures detailed event records (e.g., errors, user actions, system states) to provide context and traceability.

Together, monitoring and logging empower teams to troubleshoot issues quickly, optimize performance, and ensure high availability.


2. Key Concepts

Metrics

Metrics are quantitative measurements that describe the state or behavior of a system. Common examples include:

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Network traffic
  • Application-specific metrics (e.g., request latency, error rates)

Logs

Logs are textual records of events or actions that occur in a system. They provide context and detail, often used for debugging and forensic analysis. Logs can be:

  • Application logs (e.g., errors, warnings, debug messages)
  • System logs (e.g., operating system events)
  • Audit logs (e.g., user activities)

Tracing

Tracing is the process of tracking the flow of a request or transaction through a system. It helps understand how different components interact and identify bottlenecks. Distributed tracing is particularly useful in microservices architectures.


3. Choosing the Right Tools

Open-source vs. Commercial Solutions

  • Open-source tools (e.g., Prometheus, Grafana, ELK Stack) offer flexibility and community support but require more setup and maintenance.
  • Commercial tools (e.g., Datadog, New Relic) provide ease of use and advanced features but come with licensing costs.

Common Tools

Here are some popular tools for monitoring and logging:

  • Prometheus: An open-source monitoring and alerting toolkit.
  • Grafana: A visualization platform for metrics.
  • ELK Stack (Elasticsearch, Logstash, Kibana): A popular log aggregation and visualization stack.
  • Splunk: A commercial log management and analytics platform.
  • New Relic: A cloud-based application performance monitoring (APM) tool.
  • Datadog: A comprehensive monitoring platform for infrastructure and applications.

4. Setting Up Logging

Log Format and Structure

A well-structured log format is crucial for readability and analysis. Here's an example of a structured JSON log:

{
  "timestamp": "2023-10-05T10:15:30Z",
  "level": "ERROR",
  "service": "my-app",
  "message": "Failed to connect to database",
  "error": {
    "code": 500,
    "details": "Connection refused"
  }
}

Log Levels

Log levels help categorize the severity of events. Common levels include:

  • DEBUG: Detailed information for developers.
  • INFO: General operational information.
  • WARNING: Indications of potential issues.
  • ERROR: Significant errors that affect functionality.
  • CRITICAL: Severe errors that may cause downtime.

Example: Writing Logs in Python

Here's how to set up structured logging in Python using the built-in logging module:

import logging
import json

# Configure the logger
logger = logging.getLogger("my_app")
logger.setLevel(logging.DEBUG)

# Create a JSON formatter
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": record.asctime,
            "level": record.levelname,
            "service": "my_app",
            "message": record.getMessage(),
        }
        if record.exc_info:
            log_entry["error"] = {
                "details": str(record.exc_info[1]),
            }
        return json.dumps(log_entry)

# Add a console handler
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)

# Log a message
try:
    1 / 0
except ZeroDivisionError as e:
    logger.error("Failed to perform division", exc_info=True)

Output:

{
  "timestamp": "2023-10-05 10:15:30,123",
  "level": "ERROR",
  "service": "my_app",
  "message": "Failed to perform division",
  "error": {
    "details": "division by zero"
  }
}

5. Configuring Monitoring

Metrics Collection

Metrics are typically collected using agents or exporters that continuously monitor system and application performance. For example, Prometheus uses exporters to gather metrics from various sources.

Alerting and Notifications

Alerting systems notify you when metrics exceed predefined thresholds. For example:

  • If CPU usage exceeds 80%, send an alert.
  • If response times exceed 500ms, trigger an incident.

Example: Monitoring a Web Server

Here's how to monitor a simple web server using Prometheus and Grafana:

Step 1: Install Prometheus

  1. Download and install Prometheus:
    wget https://github.com/prometheus/prometheus/releases/download/v2.44.0/prometheus-2.44.0.linux-amd64.tar.gz
    tar xvfz prometheus-2.44.0.linux-amd64.tar.gz
    cd prometheus-2.44.0.linux-amd64
    
  2. Configure Prometheus to scrape metrics from your web server:
    global:
      scrape_interval: 15s
    
    scrape_configs:
      - job_name: 'web-server'
        static_configs:
          - targets: ['localhost:8080']
    

Step 2: Install Grafana

  1. Download and install Grafana:

    wget https://dl.grafana.com/oss/release/grafana-9.3.5.linux-amd64.tar.gz
    tar xvfz grafana-9.3.5.linux-amd64.tar.gz
    cd grafana-9.3.5
    ./bin/grafana-server web
    
  2. Add Prometheus as a data source in Grafana and create dashboards to visualize metrics like request rates, response times, and error rates.


6. Practical Insights and Best Practices

Centralized Logging

Centralized logging involves collecting logs from multiple sources into a single location. This simplifies log management and analysis. Tools like ELK Stack or Splunk can help aggregate logs from diverse systems.

Monitoring for SRE

Site Reliability Engineering (SRE) emphasizes monitoring as a core practice. Key principles include:

  • Monitor what matters: Focus on metrics that directly impact user experience.
  • Set realistic thresholds: Avoid alert fatigue by tuning thresholds based on historical data.
  • Monitor dependencies: Keep an eye on external services that your system relies on.

Security Considerations

  • Log rotation: Regularly rotate logs to prevent storage overflow.
  • Encryption: Encrypt sensitive logs to protect against data breaches.
  • Access control: Restrict log access to authorized personnel only.

7. Conclusion

Monitoring and logging are foundational practices for maintaining healthy and reliable systems. By understanding key concepts, choosing the right tools, and implementing best practices, you can build a robust monitoring and logging pipeline. Whether you're starting from scratch or refining your existing setup, the principles discussed here will help you make informed decisions.


8. Further Reading

By following the steps and best practices outlined in this guide, you'll be well-equipped to implement effective monitoring and logging solutions for your applications and infrastructure. Happy monitoring! 🚀


Stay tuned for more practical guides on DevOps and system reliability!

Subscribe to Receive Future Updates

Stay informed about our latest updates, services, and special offers. Subscribe now to receive valuable insights and news directly to your inbox.

No spam guaranteed, So please don’t send any spam mail.