Building a Cost-Effective Logging Platform using ClickHouse for Petabyte Scale

OBJECTIVES/GOAL

The goal of this project was to revamp the logging infrastructure of the customer’s internal microservices and monolithic applications to handle the exponential growth in log volumes efficiently. The objectives were to reduce operational overhead, optimize costs, improve system reliability, and ensure real-time access to log data for timely decision-making.

CHALLENGES

The customer encountered several challenges with their existing logging setup:

  • Managing and scaling Elasticsearch clusters became increasingly complex and expensive as log volumes surged.
  • Over-provisioning clusters to accommodate fluctuating traffic patterns resulted in unnecessary costs without commensurate performance gains.
  • The semi-structured nature of logs made schema design and query optimization challenging.
  • Ensuring real-time data ingestion and rapid query performance to support timely insights was difficult.
  • Balancing the need for high performance with cost-effectiveness in infrastructure design was a significant challenge.
ACCOMPLISHMENTS

The implementation of ClickHouse as the logging platform led to significant achievements:

    • Real-time data ingestion with an ingestion lag of less than 5 seconds ensured timely availability of logs for analysis and decision-making.
    • Lightning-fast queries with a P99 query time of 10 seconds accelerated data processing and facilitated prompt insights.
    • Customized solutions, including schema design, a custom SDK for structured logging, and data tiering, addressed the challenges of managing semi-structured logs and optimized performance.
    • The platform’s auditing capabilities and cost-effectiveness had the potential to save over a million dollars per year compared to the previous setup.
    • Enhanced security measures, including Google Authentication, table-level access control, and query auditing, ensured only authorized access to sensitive log data, improving overall system security.
TECHNOLOGIES USED
  • ClickHouse: Chosen as the primary database for its ability to handle terabytes of data with low latency, scalability, and cost-effectiveness.
  • AWS EC2: Utilized for hosting ClickHouse clusters, with 10 EC2 nodes of M6g.16xlarge configuration to handle peak traffic and ensure high availability.
  • Golang: Custom Golang workers were developed for efficient log ingestion into ClickHouse, leveraging batch processing and spot instances for cost optimization.
  • Prometheus and Grafana: Used for monitoring ClickHouse metrics, visualizing performance data, and setting alerts to ensure system reliability.
  • Google Authentication: Employed for access control, ensuring only authorized individuals could query the logs.
  • Table-Level Access Control and Query Auditing: Implemented for enhanced security and compliance with data privacy regulations.
  • TTL-based Data Tiering: Managed the data lifecycle by moving older data to cold storage after 24 hours and deleting data after 3 months to optimize storage costs.