Possible head aches when installing Rancher management nodes on Ubuntu 20.04

A few weeks ago a wrote an article praising the simplicity of installing a Rancher management cluster. I tried it again last week and ran into some issues. Some silly, some a bit convoluted. Anyway, this guide covers some of the mistakes I made and the issues I ran into.

To begin with, here’s my rancher-cluster.yml:

nodes:
  - address: 192.168.1.11
    user: rancher
    role: [controlplane, worker, etcd]
    ssh_key_path: ~/.ssh/id_rsa
  - address: 192.168.1.12
    user: rancher
    role: [controlplane, worker, etcd]
    ssh_key_path: ~/.ssh/id_rsa
  - address: 192.168.1.13
    user: rancher
    ssh_key_path: ~/.ssh/id_rsa
    role: [controlplane, worker, etcd]

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h


ingress:
  provider: nginx
  options:
    use-forwarded-headers: "true"

Not being able to connect via SSH

Symptom

Rancher says it’s unable to connect to the other nodes

Example:

Failed to set up SSH tunneling for host [192.168.70.233]: Can't retrieve Docker Info

Possible Resolutions

  1. Make sure that you can connect using the defined SSH key with the user you want to run rancher as to all of the intended K8s nodes. In the example above I would test connecting to each of the nodes as the user rancher using ~/.ssh/id_rsa. Additional note is that some guides might create dsa keys, in which case you’ll have to change the name to id_dsa in the config. Might sound obvious but the difference is subtle.
  2. Rancher is connecting to Docker via a local socket. In order for this to work you need to enable TCP Forwarding. This is likely the reason is the output from RKE up says the tunnel has been created but that it can’t reach docker. Fix this by editing /etc/ssh/sshd_config and making sure that the following line is not commented out:
    AllowTcpForwarding yes

Old config and network firewall complaints

Rancher says that the nodes probably has their firewalls enabled or that there’s network issues.

Symptom

Example:

[network] Host [192.168.x.y] is not able to connect to the following ports: [192.168.x.y:2379]. Please check network policies and firewall rules

Possible Resolutions

  1. This could be the case of an old configuration still being around. Try to clean/erase old config by running:
    rke remove –config ./rancher-cluster.yml
  2. Still failing? Try this command to skip the network checks:
    rke up –config ./rancher-cluster.yml –disable-port-check

Read-only mounts

Rancher says that is is denied mounting different mount source paths

Symptom

Example:

error while creating mount source path '/var/lib/etcd': mkdir /var/lib/etcd: permission denied

Possible Resolutions

This is likely an issue with conflicting docker versions being installed. I followed the official installation instructions but for some reason the snap version of docker was still installed on all my nodes. Removing the snap version of docker did the trick:
sudo snap remove docker –purge

etcd health check fails

Rancher says that the etcd cluster health failed and the cluster initialization fails.

Symptom

Example:

rancher Error response from daemon: error while creating mount source path '/var/lib/etcd': mkdir /var/lib/etcd: read-only file system

Possible Resolution

Make sure that all the nodes are resolvable by hostname via DNS. For example if node 1 has the hostname of ranchermgmt-01.domain.com the system DNS on each node should be able to resolve ranchermgmt-01.domain.com to the servers IP address.

Got more?

These are all the issues I ran into. Do you have more, or do you have other possible solutions you wish to share? Let me know and I’ll happily update the guide!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *