Cutting Cloud Costs: Migrating from Fluentd to Vector.dev for Log Collection in Kubernetes

Tags = [ cloud, logs, aws ] Posted on February 8, 2025 at 19:00 EST

As cloud costs continue to rise, optimizing infrastructure components like log collection can yield significant savings. Traditional logging solutions like Fluentd, while powerful, can introduce high resource consumption and operational overhead. In this post, we explore migrating from Fluentd to Vector.dev, a lightweight and efficient alternative, to reduce costs while maintaining a robust logging pipeline.

Understanding the Cost Implications of Fluentd

Fluentd is widely used for log collection but comes with several notable cost factors:

Resource Consumption: Fluentd’s Ruby-based architecture leads to higher CPU and memory usage, especially in larger environments. This increased resource consumption can drive up infrastructure costs, particularly when running multiple Fluentd instances.
Operational Overhead: Managing Fluentd configurations and scaling can become complex. Ensuring high availability and performance often requires careful monitoring, adjustments, and scaling strategies, which can increase operational costs and effort.
Documentation Confusion: Fluentd’s documentation around buffering and available sink options can sometimes be unclear, leading to challenges in proper configuration. Misunderstandings can lead to inefficient setups or additional troubleshooting efforts.
Frequent Container Restarts: A common operational task is restarting Fluentd containers to address issues with log collection. This step, while necessary to maintain uptime, adds operational overhead and can be disruptive, requiring additional monitoring and manual intervention.

Why Choose Vector?

Vector presents several advantages over Fluentd:

Performance: Built with Rust, Vector provides superior performance with significantly lower CPU and memory consumption compared to Fluentd’s Ruby-based architecture. This leads to better resource utilization, especially in large-scale environments.
Efficiency: Vector is designed for high-efficiency log processing, minimizing latency and improving throughput. This enables faster log collection, transformation, and forwarding, reducing the time needed for log data to reach your analytics platforms.
Flexible: Vector supports a wide range of log sources and sinks, making it highly adaptable to various infrastructure setups. It also offers rich transformation capabilities, allowing you to easily manipulate and filter log data before forwarding it to its final destination.

Current Log Collection Architecture

Before migration, our log collection setup includes:

Pods Log Collector: FluentBit agents were deployed on all Kubernetes nodes to collect logs from pods. The cost of using FluentBit was comparable to Vector; the only reason for migrating was to unify the log collection platform.
Legacy Java Logs: Application logs were sent using Logback and LogstashTcpSocketAppender, with nine application collectors running on m5.large instances in AWS.
Aggregator: All logs were forwarded to aggregators, which sent logs in bulk to an Elasticsearch endpoint for indexing and analysis. Five aggregators ran on r5.large instances.

Although QA and development environments ran on spot instances, production was fully on-demand, costing approximately $550 per month with EC2 saving plan.

Planning the Migration

A smooth migration requires:

Assessing the Fluentd Setup: Identifying all logging sources and destinations.

TCP Collector: Logs in JSON format, with multiple ports featuring different filtering and index routing logic.
HTTP Collector: Logs in JSON format.
Filebeat Logs: Logs in Logstash format from various custom servers outside the cluster, including AWS EMR clusters.
FluentBit and Fluentd: Logs in Fluent format, forwarded to the aggregator.

Evaluating Compatibility: Ensuring Vector supports required log sources and processing. Vector supports all the formats and sources mentioned above as well as elasticsearch output optioin.
Migration Strategy: Phased rollout to minimize disruptions. Since Vector has its own native sink option, it was much easier to start with the aggregator and then gradually integrate other components. This approach avoided the Fluent → Vector-Aggregator → Elasticsearch transition and instead enabled a direct Vector → Vector-Aggregator → Elasticsearch setup from the beginning. On the agent side, we could deploy both Fluentd and Vector simultaneously, allowing for a gradual migration of specific namespaces or pods step by step. The most challenging part was migrating all the custom TCP and HTTP connectors, as well as translating the routing configuration logic from Fluentd to Vector.

Step-by-Step Migration Guide

Setting Up Vector as an Aggregator: Forward logs efficiently to the Elasticsearch endpoint. A simple configuration was used to collect logs from Vector pods and push them to Elasticsearch, with a failover index name in case any data was lost during the migration:

[sources.vector]
address = "0.0.0.0:6000"
type = "vector"
version = "2"

[transforms.vector_index]
inputs = ["vector"]
source = '''
if is_nullish(.index_name) {
    .index_name = "void"
}

[sinks.es_cluster]
compression = "gzip"
endpoints = ["${OPENSEARCH_ENDPOINT}"]
inputs = ["vector_index"]
type = "elasticsearch"
[sinks.es_cluster.bulk]
index = "{{ index_name }}-%Y-%m-%d"

Deploying Vector as a DaemonSet and Configuring for Pod Logs: Deploy Vector agents along with Fluentbit in Kubernetes. Migrating services one by one was fairly straightforward: enable the configuration on Vector and disable it on FluentBit. Below is a configuration example for the nginx-ingress controller on the Vector agent, with forwarding to Vector-Aggregator at the end:

[sources.nginx]
extra_field_selector = "metadata.namespace=nginx"
extra_label_selector = "app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx"
type = "kubernetes_logs"

[transforms.nginx_index]
inputs = ["nginx"]
source = '''
parsed =
    parse_regex(.message, r'(?P<remote_addr>.+) - (?P<remote_user>.*) \[(?P<time_local>.*)\] "(?P<method>.+) (?P<request>.*) (?P<protocol>.*)" (?P<status>\d+) (?P<body_bytes_sent>.*) "(?P<http_referer>.*)" "(?P<http_user_agent>.*)" (?P<request_length>.*) (?P<request_time>.*) \[(?P<proxy_upstream_name>.*)\] \[(?P<proxy_alternative_upstream_name>.*)\] (?P<upstream_addr>.*) (?P<upstream_response_length>.*) (?P<upstream_response_time>.*) (?P<upstream_status>.*) (?P<req_id>.*)') ??
    parse_regex!(.message, r'^(?P<message>.*)')
. = merge(., parsed)
.timestamp = parse_timestamp(.timestamp, "%d/%B/%Y:%H:%M:%S %z") ?? now()
.environment = "${ENV}"
.index_name = "nginx"
'''
type = "remap"

[sinks.vector_aggr]
address = "${VECTOR_AGGR_ENDPOINT}"
compression = true
inputs = [
  "nginx_index",
]
type = "vector"
version = "2"

Handling Legacy Java Logs: Configure Vector to receive logs from the LogstashTcpSocketAppender while saving .@timestamp = .timestamp to avoid breaking log alerts and dashboards in Grafana and Kibana.

Below is an example of the TCP socket migration from Fluentd to Vector, with some filtering applied:

[sources.tcp_legacy]
address = "0.0.0.0:24221"
mode = "tcp"
type = "socket"
[sources.tcp_legacy.decoding]
codec = "json"

[transforms.tcp_legacy_filtered]
condition = '!includes(["DEBUG","TRACE"], .level)'
inputs = ["tcp_legacy"]
type = "filter"

[transforms.tcp_legacy_index]
inputs = ["tcp_legacy_filtered"]
source = """
.index_name = "my_tcp_legacy_index"
.environment = "${ENV}"
"""
type = "remap"

HTTP input example:

[sources.http_legacy]
address = "0.0.0.0:24223"
path = "my.custom.path"
type = "http"
[sources.http_legacy.decoding]
codec = "json"

[transforms.http_legacy_index]
inputs = ["http_legacy"]
source = """
.index_name = "my_http_legacy_index"
.@timestamp = .timestamp
.environment = "${ENV}"
"""
type = "remap"

Example of Logstash logs received from Filebeat on EMR clusters:

[sources.emr_filebeat]
address = "0.0.0.0:24218"
type = "logstash"

[transforms.emr_filebeat_index]
inputs = ["emr_filebeat"]
source = """
.index_name = "emr"
.@timestamp = .timestamp
"""
type = "remap"

Integrating Vector Metrics with Prometheus and collecting internal logs: Configure Vector to expose internal metrics to Prometheus for real-time monitoring and analysis.

[sources.internal_metrics]
scrape_interval_secs = 60
type = "internal_metrics"

[sinks.prom_exporter]
address = "0.0.0.0:9090"
inputs = ["internal_metrics"]
type = "prometheus_exporter"

[sources.vector_logs]
type = "internal_logs"

[transforms.vector_logs_index]
inputs = ["vector_logs"]
source = '''
.index_name = "vector"
'''
type = "remap"

Testing & Validation: Testing and validation were carried out per service to ensure a smooth transition and maintain log integrity. The process included:

Analyzing Vector Internal Logs: In case of errors, detailed analysis of Vector’s internal logs was performed to identify and resolve issues promptly. This allowed for real-time troubleshooting and ensured that logs were being correctly processed and forwarded.
Monitoring Metrics in Grafana: All relevant metrics were monitored through Grafana dashboards to track the performance of the Vector agents and overall log flow. Metrics such as log collection rate, processing time, and resource usage were closely observed to ensure that the system met the required performance standards before full deployment.
Log Integrity Comparison: We compared the integrity of logs from both Fluentd and Vector to ensure there were no discrepancies in the log data, structure, or content during the migration process.

Deploying in Production: Deployment in production was done gradually to ensure minimal disruption and maintain system stability:

Gradual Traffic Shift: Traffic was gradually shifted from the legacy log collection system to Vector, allowing for a seamless transition and monitoring of system behavior at each step. This ensured that any potential issues could be identified and resolved before affecting the entire production environment.
Performance Monitoring: Continuous monitoring of system performance was conducted, focusing on key metrics like log collection speed, processing times, and resource utilization. These metrics were tracked using Grafana to ensure the deployment met performance expectations and to identify any potential bottlenecks.

Cost Savings Analysis

Reduced Resource Usage: After the migration was completed, the resource usage for log collection significantly decreased:
- Aggregators: With just 3 aggregators, CPU usage averaged around 0.5 cores per aggregator, while memory consumption was about 350MB per aggregator.
- Java Service Logs: The 3 pods dedicated to collecting Java service logs had an average memory usage of 850MB each, with CPU usage comparable to the aggregators.
- Agent Resource Usage: The overall resource usage of the Vector agents was similar to that of FluentBit, despite offering better performance. This shows how Vector’s efficiency results in lower CPU and memory consumption, leading to cost savings on infrastructure.
Lower Infrastructure Overhead: Due to Vector's lower CPU and memory consumption, infrastructure overhead was substantially reduced. For instance, with just 3 aggregators and 3 pods for Java service logs, the average CPU usage was around 0.5 cores per aggregator, and memory usage was about 350MB per aggregator and 850MB per Java log collector pod. This is significantly more efficient than Fluentd’s resource demands, which reduces the need for excess compute resources.
Lower Cloud Costs: Before the migration, the setup included 9 collectors running on m5.large instances and 5 aggregators on r5.large instances, totaling more then $1k per month in cloud costs. After migrating to Vector, the setup was reduced to six t3.small instances. With a cost of less $100 per month, the new setup drastically reduced infrastructure costs, resulting in a significant reduction in overall cloud expenses.

Before:

usage number of pods EC2 instance type cost on-demand

aggregator 5 r5.large 600

TCP collector 9 m5.large 460

After:

usage number of pods EC2 instance type cost on-demand

aggregator 3 t3.small 46

TCP collector 3 t3.small 46

usage	number of pods	EC2 instance type	cost on-demand
aggregator	5	r5.large	600
TCP collector	9	m5.large	460

usage	number of pods	EC2 instance type	cost on-demand
aggregator	3	t3.small	46
TCP collector	3	t3.small	46

Mooving forward

We were so impressed with the performance and efficiency of Vector that we ended up migrating all systems to it, including EMR clusters and all EC2 instances outside the Kubernetes cluster, such as MongoDB and Cassandra.

Conclusion

Migrating from Fluentd to Vector offers tangible cost savings, improved performance, and a more efficient logging infrastructure. By carefully planning and executing the transition, organizations can optimize their cloud costs while maintaining reliable log collection.

DevOops

Code, Bugs, and Coffee Mugs