Monitoring and Observability - Part 2

Back to our journey through monitoring and observability! In Part 1, we explored how leveraging logging with tools like Pino, Loki, Promtail, and Grafana can provide valuable insights into our application’s behavior. Now, in Part 2, we’ll take things a step further by introducing distributed tracing with Jaeger.

Distributed tracing is a powerful technique that allows us to follow a request as it traverses through the various components of our system. It helps us understand the flow of data, identify performance bottlenecks, and pinpoint the root cause of issues more effectively. And that’s where Jaeger comes into play.

Jaeger is an open-source, end-to-end distributed tracing system that enables us to monitor and troubleshoot transactions in complex microservices architectures. It follows the OpenTracing specification, making it compatible with a wide range of technologies and frameworks.

To demonstrate the setup and usage of Jaeger, we’ll continue working with our fictional e-commerce REST API built with Hono and Bun. As a refresher, our API consists of three endpoints: /products, /carts, and /orders.

Setting up Jaeger

Before we start instrumenting our code, let’s set up Jaeger. The easiest way to get started is by using the Jaeger all-in-one Docker image in our docker compose.

version: "3"

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "6831:6831/udp" # Jaeger thrift compact
      - "6832:6832/udp" # Jaeger thrift binary
      - "14250:14250" # Model/collector gRPC
      - "16686:16686" # Web UI
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP http receiver
    networks:
      - monitoring-net
    environment:
      - COLLECTOR_OTLP_ENABLED=true
# ...

Jaeger offers different ports to send the data through but we are going to be using 4318 as the OTLP http receiver. And we are going to extend our grafana datasource configuration file to be able to view the traces from Grafana directly.

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    version: 1
    isDefault: true

  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    uid: jaeger
    version: 1
    editable: true
    jsonData:
      nodeGraph:
        enabled: true

Instrumenting our code

Now that Jaeger is up and running, let’s instrument our code to start collecting traces. We’ll use the opentelemetry-js library to generate and export traces to Jaeger.

First, install the required dependencies:

bun add @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/exporter-jaeger \
  @opentelemetry/semantic-conventions \
  @opentelemetry/resources

Next, create a instrumentation.ts file to configure and initialize the OpenTelemetry SDK:

import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
const otlpExporter = new OTLPTraceExporter({
  url: "http://localhost:4318/v1/traces", // OTLP HTTP endpoint
});

// Initialize OpenTelemetry SDK
export const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "honomart",
    [ATTR_SERVICE_VERSION]: "1.0.0",
    [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || "development",
  }),
  spanProcessor: new BatchSpanProcessor(otlpExporter),
});

// Handle shutdown gracefully
process
  .on("SIGTERM", () => {
    sdk
      .shutdown()
      .then(() => console.log("SDK shut down successfully"))
      .catch((error) => console.log("Error shutting down SDK", error))
      .finally(() => process.exit(0));
  })
  .on("SIGINT", () => {
    sdk
      .shutdown()
      .then(() =>
        console.log("Process was interrupted. SDK shut down successfully")
      )
      .catch((error) =>
        console.log("Process was interrupted. Error shutting down SDK", error)
      )
      .finally(() => process.exit(0));
  });

// Start the SDK
sdk.start();

In this configuration, we create a Resource instance with a service name resource attribute. We also set up a OTLPTraceExporter to send traces to our Jaeger instance and add it as a span processor.

Now, let’s instrument our /products/:id endpoint to create spans and add attributes:

.get('/:id', (c) => {
  const log = c.get('logger');
  const parentSpan = c.get('span');
  const id = c.req.param('id');
  const db = c.get('db');

  return tracer.startActiveSpan(
    'products.get',
    {
      kind: SpanKind.SERVER,
      attributes: {
        'products.operation': 'get',
        'product.id': id,
      },
    },
    async (span) => {
      span.addLink({
        context: parentSpan.spanContext(),
        attributes: {
          relationship: 'parent-child',
          'operation.type': 'list-products',
        },
      });

      log.debug({ msg: 'Fetching product by id', productId: id });
      span.addEvent('Searching for product');

      const product = await db.product.findFirst({ where: { id } });

      if (!product) {
        span.setAttributes({
          'products.found': false,
        });
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: 'Product not found',
        });
        log.warn({ msg: 'Product not found', productId: id });
        return c.json({ message: 'Product not found' }, 404);
      }

      span.setAttributes({
        'products.found': true,
        'product.name': product.name,
      });

      log.info({ msg: 'Product retrieved successfully', productId: id });
      span.setStatus({ code: SpanStatusCode.OK });
      return c.json(product);
    },
  );
})

In this code snippet, we create a new span named products.get using tracer.startActiveSpan. We set attributes like products.operation and product.id to provide more context. We also add a link to the parent span, indicating the relationship between the spans.

Throughout the request handling, we add events and attributes to the span to capture relevant information. If the product is not found, we set an error status on the span. Finally, we set a success status if the product is retrieved successfully.

Viewing traces in Grafana UI

Traces in Grafana UI

With our code instrumented, let’s make some requests to our API and observe the traces in the Jaeger UI.

Open the Grafana dashboard and go to Explore to select Jaeger at http://localhost:3030 and select the honomart service from the dropdown. You should see a list of traces captured from our API requests.

Click on a trace to view its details. You’ll see a timeline of the spans within the trace, along with their durations and any associated tags and logs. This allows us to visualize the flow of the request through our system and identify any potential bottlenecks or issues.

Conclusion

By incorporating distributed tracing with Jaeger, we’ve gained a whole new level of visibility into our application. We can now trace requests as they propagate through our system, understand the relationships between different services, and pinpoint performance issues with ease.

Jaeger, along with the logging setup we explored in Part 1, forms a powerful combination for monitoring and observability. Together, they provide us with the tools and insights necessary to build and maintain robust, reliable applications.

I hope this deep dive into monitoring and observability has been informative and inspiring. Remember, investing in observability is key to building resilient systems that can withstand the challenges of today’s complex software landscape.