Fluentd log collector on Docker Swarm using an Unix socket

I’ve been working on an EFK stack for the past couple days, and figured I’d set up Fluentd on my Swarm to forward the log to the aggregator.

Of course, I didn’t want to install it on the hosts themselves, as I already had Docker up and running; why bother with a manual installation?

The goal was to get my Swarm’s Traefik containers logs to Elasticsearch, and Docker have a Fluentd driver built-in. Perfect! Just need to redirect Traefik’s access logs (which are configured to go to stdout, and are then picked up by Docker) to a Fluentd container, and they’re gone.

Matter is, to do that Docker needs to be able to access Fluentd, somehow. I did not really want to use a TCP socket, as I did not especially need it, nor did I want to spend too much time trying to secure it and avoid accidental internet exposure.

So I figured that, if I’m going to run Fluentd on every nodes, why not simply expose an Unix socket through a bind volume and call it a day?

… and that’s exactly what this stack file does:

version: "3.7"

services:
  fluentd:
    image: fluent/fluentd:v1.10.4-1.0
    user: root
    volumes:
      - type: bind
        source: /var/run/fluentd-docker/
        target: /var/run
    configs:
      - source: fluent-forwarder
        target: /fluentd/etc/fluent.conf
    deploy:
      mode: global
      restart_policy:
        condition: on-failure

configs:
  fluent-forwarder:
    file: ./fluent-forwarder.conf

As you can see, I’m forcing the Fluentd container to run as root. This is because I am unsure as to how to secure the socket properly: I do not know if changing the ownership of /var/run/fluentd-docker/ to Fluentd’s default UID is a good idea or not. For the time being, I decided to roll with the root solution, as the container is (no longer) directly exposed to the internet. It is quite easy to change later anyway.

It has to be coupled with a Fluentd configuration that I named fluent-forwarder.conf. Here’s a template I use for Traefik logging purposes:

<source>
  @type unix
  path /var/run/td-agent.sock
</source>

<filter docker.swarm.traefik>
  @type parser
  key_name log
  <parse>
    @type json
    time_type string
  </parse>
</filter>

<match docker.*.*>
  @type forward
  transport tls
  <server>
    host <collector.example.com>
    port <24224>
  </server>
  <security>
    self_hostname <swarm.example.com>
    shared_key <bigkey>
  </security>
</match>

To boot up a stack using these, you’ll need to:

Get these two files in the same directory (fluentd-stack.yml and fluent-forwarder.conf)
Edit the Fluentd config according to your environment (particularly the values inbetween <>)
Create the /var/run/fluentd-docker/ directory manually on your hosts (somewhat annoying, I know. For some reason, Docker doesn’t create bind mount points automatically when using them on a Swarm)
Deploy the stack (i.e. docker deploy -c fluentd-stack.yml fluentd on a manager node)

Once your stack is up and running, you can check that a new socket is available on each nodes running Fluentd in /var/run/fluent-docker/td-agent.sock. You will then need to tell Docker to redirect the logs of containers that interest you to that Unix socket.

To do that, all you need to do is adding a logging part in your stack’s yml:

    logging:
      driver: fluentd
      options:
        tag: docker.swarm.traefik
        fluentd-address: unix:///var/run/fluentd-docker/td-agent.sock
        fluentd-async-connect: "true"

Change the tag to your liking (and to stay consistent with your Fluentd filter configuration), and you’re good to deploy.

For those that may be interested, here’s my current (as of writing) full Traefik stack file:

version: "3.7"

services:
  traefik: 
    image: traefik:v2.2.1
    environment:
      - "TZ=Europe/Paris"
    command:
      - "--entrypoints.web.address=:80"
      - "--entrypoints.web-secure.address=:443"
      - "--providers.docker.endpoint=tcp://socketproxy:2375"
      - "--certificatesResolvers.le-ssl.acme.tlsChallenge=true"
      - "--certificatesResolvers.le-ssl.acme.email=-snip-"
      - "--certificatesResolvers.le-ssl.acme.storage=/etc/traefik/acme.json"
      - "--log.format=json"
      - "--accesslog=true"
      - "--accesslog.format=json"
      - "--providers.docker.swarmMode=true"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.docker.swarmModeRefreshSeconds=5"
      - "--providers.docker.network=traefik-net"
    volumes:
      - type: volume
        source: traefik-data
        target: /etc/traefik
        volume:
          nocopy: true
    networks:
      - traefik-net
      - socket-private
    ports:
      - target: 80
        published: 80
        protocol: tcp
        mode: host
      - target: 443
        published: 443
        protocol: tcp
        mode: host
    logging:
      driver: fluentd
      options:
        tag: docker.swarm.traefik
        fluentd-address: unix:///var/run/fluentd-docker/td-agent.sock
        fluentd-async-connect: "true"
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: any

  socketproxy:
    image: tecnativa/docker-socket-proxy:latest
    networks:
      - socket-private
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      NETWORKS: 1
      SERVICES: 1
      TASKS: 1
    deploy:
      mode: global
      placement:
        constraints: 
          - node.role == manager
      restart_policy:
        condition: any

volumes:
    traefik-data:
      driver_opts:
        type: nfs
        o: "-snip-"
        device: ":-snip-"

networks:
  traefik-net:
    external: true
  socket-private:
    driver: overlay
    internal: true

I’m using a NFS share for shared storage and a socket proxy to run Traefik on each of my nodes, but those aren’t mandatory for Fluentd, of course.

And that’s it for Fluentd on a Swarm. I may write another post on making an aggregator for Elastic someday. In the meantime, happy logging!